Terraform at Scale: Managing Multi-Environment Infrastructure Without Losing Your Mind

Kicked TeamFebruary 12, 20266 min read

Your first Terraform project is magical. You write some HCL, run terraform apply, and infrastructure appears. Your 50th project is a nightmare of copy-pasted modules, state file conflicts, and a CI pipeline that takes 40 minutes.

We manage Terraform across dozens of client environments. Here's every pattern we've learned the hard way.

The Monorepo vs. Polyrepo Decision

This is the first fork in the road, and most teams choose wrong.

Monorepo (all infrastructure in one repo):

  • ✅ Easy to search and refactor across environments
  • ✅ Single CI pipeline to maintain
  • ❌ Blast radius — a bad merge can affect everything
  • ❌ Slow plans as the codebase grows

Polyrepo (one repo per environment/project):

  • ✅ Isolation — changes can't leak across environments
  • ✅ Independent release cycles
  • ❌ Module versioning becomes critical
  • ❌ Harder to maintain consistency

We use a hybrid approach: one repo per logical boundary (e.g., per client or per product), with a shared module registry. This gives us isolation where it matters and reuse where it helps.

Repository Structure

After years of iteration, this is our standard layout:

infrastructure/
├── environments/
│   ├── production/
│   │   ├── networking/
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── terraform.tfvars
│   │   ├── compute/
│   │   ├── database/
│   │   └── dns/
│   ├── staging/
│   │   └── ... (mirrors production)
│   └── shared/
│       ├── iam/
│       └── monitoring/
├── modules/
│   ├── vpc/
│   ├── server/
│   ├── database/
│   └── monitoring-stack/
└── terragrunt.hcl

Key principles:

  1. One state file per resource groupnetworking, compute, database each have their own state. A bad terraform apply in compute can't destroy your network.
  2. Environments mirror each other — Staging is structurally identical to production. If it works in staging, it works in production.
  3. Modules are versioned — Tagged releases, semantic versioning. Production pins to stable versions.

State Management

Remote state is non-negotiable. Local state files are a ticking time bomb.

# backend.tf
terraform {
  backend "s3" {
    bucket         = "kicked-terraform-state"
    key            = "production/networking/terraform.tfstate"
    region         = "eu-central-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Rules we enforce:

  • State locking — DynamoDB (AWS) or equivalent. Two engineers running apply simultaneously will corrupt your state.
  • Encryption at rest — State files contain secrets (database passwords, API keys). Always encrypt.
  • No manual state surgeryterraform state mv and terraform import only through CI, with audit trail.
  • State file per resource group — Not per environment. A single state file for all of production is a recipe for 20-minute plan times.

Terragrunt: The DRY Layer

Raw Terraform gets repetitive fast. Every environment needs the same backend config, the same provider setup, the same variable patterns. Terragrunt eliminates the duplication:

# environments/production/networking/terragrunt.hcl
terraform {
  source = "../../../modules/vpc"
}

include "root" {
  path = find_in_parent_folders()
}

inputs = {
  environment = "production"
  cidr_block  = "10.0.0.0/16"
  az_count    = 3
}
# root terragrunt.hcl
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "kicked-terraform-state"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "eu-central-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

The ${path_relative_to_include()} automatically generates unique state keys based on the directory path. No more copy-pasting backend configs.

Module Design Principles

Bad modules are worse than no modules. Here's what makes a good one:

1. Opinionated Defaults, Full Overrides

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"  # Sensible default
}

variable "monitoring_enabled" {
  description = "Enable detailed monitoring"
  type        = bool
  default     = true  # Safe default
}

A module should work with zero optional variables. But every default should be overridable.

2. Outputs Are Your API

# Good: outputs everything downstream might need
output "vpc_id" { value = aws_vpc.main.id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "nat_gateway_ips" { value = aws_eip.nat[*].public_ip }

# Bad: "just look at the state file"

3. Validate Early

variable "environment" {
  type = string
  validation {
    condition     = contains(["production", "staging", "development"], var.environment)
    error_message = "Environment must be production, staging, or development."
  }
}

Catch mistakes at plan time, not apply time.

CI/CD Pipeline

Nobody runs terraform apply locally. Our pipeline:

# Simplified CI flow
on:
  pull_request:
    paths: ["environments/**"]

jobs:
  plan:
    steps:
      - name: Detect changed environments
        # Only plan what changed
      - name: terraform init
      - name: terraform validate
      - name: terraform plan
        # Output saved as artifact
      - name: Post plan as PR comment
        # Reviewers see exact changes

  apply:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - name: terraform apply
        # Uses the saved plan — no drift between plan and apply

Critical rules:

  • Plan on PR — Every reviewer sees the exact changes before merge
  • Apply on merge — Merging to main triggers the apply. No manual steps.
  • Saved plans — The plan artifact from PR is what gets applied. No surprises.
  • Changed-only — Don't plan all 50 environments when one file changed

Drift Detection

Infrastructure drifts. Someone clicks in a console, a security team adds a rule manually, an auto-scaler changes instance counts. You need to detect it:

# Run nightly via cron
terragrunt run-all plan -detailed-exitcode

# Exit code 0 = no changes (in sync)
# Exit code 1 = error
# Exit code 2 = changes detected (DRIFT!)

We run drift detection nightly. Any drift triggers a Slack alert and a ticket. The resolution is always the same: either update the Terraform to match reality, or re-apply to match Terraform.

Secrets Handling

Never put secrets in .tfvars files. Options:

  1. Environment variablesTF_VAR_db_password set in CI
  2. Vault provider — Read secrets from HashiCorp Vault at plan time
  3. SOPS — Encrypt secret .tfvars files with age/KMS keys
  4. 1Password/AWS SSM — Reference secrets by path, resolve at apply time

We prefer the Vault provider for production and SOPS for smaller setups.

Common Mistakes We Fix

Mistake Fix
One giant state file Split by resource group
Copy-pasting between environments Use Terragrunt or workspaces
Running apply locally CI/CD only, saved plans
No state locking DynamoDB/equivalent, always
Hardcoded values in modules Variables with validation
No drift detection Nightly plan cron
Secrets in state, unencrypted Encrypted backend + sensitive outputs

Getting Started

If your Terraform is already a mess, don't try to fix everything at once:

  1. Week 1 — Move state to a remote encrypted backend with locking
  2. Week 2 — Split your monolithic state into resource groups
  3. Week 3 — Extract your first shared module
  4. Week 4 — Set up CI/CD with plan-on-PR

Or let us handle it. We've migrated teams from "one big main.tf" to scalable multi-environment setups in under a month.