Terraform at Scale: Managing Multi-Environment Infrastructure Without Losing Your Mind
Your first Terraform project is magical. You write some HCL, run terraform apply, and infrastructure appears. Your 50th project is a nightmare of copy-pasted modules, state file conflicts, and a CI pipeline that takes 40 minutes.
We manage Terraform across dozens of client environments. Here's every pattern we've learned the hard way.
The Monorepo vs. Polyrepo Decision
This is the first fork in the road, and most teams choose wrong.
Monorepo (all infrastructure in one repo):
- ✅ Easy to search and refactor across environments
- ✅ Single CI pipeline to maintain
- ❌ Blast radius — a bad merge can affect everything
- ❌ Slow plans as the codebase grows
Polyrepo (one repo per environment/project):
- ✅ Isolation — changes can't leak across environments
- ✅ Independent release cycles
- ❌ Module versioning becomes critical
- ❌ Harder to maintain consistency
We use a hybrid approach: one repo per logical boundary (e.g., per client or per product), with a shared module registry. This gives us isolation where it matters and reuse where it helps.
Repository Structure
After years of iteration, this is our standard layout:
infrastructure/
├── environments/
│ ├── production/
│ │ ├── networking/
│ │ │ ├── main.tf
│ │ │ ├── variables.tf
│ │ │ ├── outputs.tf
│ │ │ └── terraform.tfvars
│ │ ├── compute/
│ │ ├── database/
│ │ └── dns/
│ ├── staging/
│ │ └── ... (mirrors production)
│ └── shared/
│ ├── iam/
│ └── monitoring/
├── modules/
│ ├── vpc/
│ ├── server/
│ ├── database/
│ └── monitoring-stack/
└── terragrunt.hcl
Key principles:
- One state file per resource group —
networking,compute,databaseeach have their own state. A badterraform applyin compute can't destroy your network. - Environments mirror each other — Staging is structurally identical to production. If it works in staging, it works in production.
- Modules are versioned — Tagged releases, semantic versioning. Production pins to stable versions.
State Management
Remote state is non-negotiable. Local state files are a ticking time bomb.
# backend.tf
terraform {
backend "s3" {
bucket = "kicked-terraform-state"
key = "production/networking/terraform.tfstate"
region = "eu-central-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Rules we enforce:
- State locking — DynamoDB (AWS) or equivalent. Two engineers running
applysimultaneously will corrupt your state. - Encryption at rest — State files contain secrets (database passwords, API keys). Always encrypt.
- No manual state surgery —
terraform state mvandterraform importonly through CI, with audit trail. - State file per resource group — Not per environment. A single state file for all of production is a recipe for 20-minute plan times.
Terragrunt: The DRY Layer
Raw Terraform gets repetitive fast. Every environment needs the same backend config, the same provider setup, the same variable patterns. Terragrunt eliminates the duplication:
# environments/production/networking/terragrunt.hcl
terraform {
source = "../../../modules/vpc"
}
include "root" {
path = find_in_parent_folders()
}
inputs = {
environment = "production"
cidr_block = "10.0.0.0/16"
az_count = 3
}
# root terragrunt.hcl
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "kicked-terraform-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "eu-central-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
The ${path_relative_to_include()} automatically generates unique state keys based on the directory path. No more copy-pasting backend configs.
Module Design Principles
Bad modules are worse than no modules. Here's what makes a good one:
1. Opinionated Defaults, Full Overrides
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.medium" # Sensible default
}
variable "monitoring_enabled" {
description = "Enable detailed monitoring"
type = bool
default = true # Safe default
}
A module should work with zero optional variables. But every default should be overridable.
2. Outputs Are Your API
# Good: outputs everything downstream might need
output "vpc_id" { value = aws_vpc.main.id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "nat_gateway_ips" { value = aws_eip.nat[*].public_ip }
# Bad: "just look at the state file"
3. Validate Early
variable "environment" {
type = string
validation {
condition = contains(["production", "staging", "development"], var.environment)
error_message = "Environment must be production, staging, or development."
}
}
Catch mistakes at plan time, not apply time.
CI/CD Pipeline
Nobody runs terraform apply locally. Our pipeline:
# Simplified CI flow
on:
pull_request:
paths: ["environments/**"]
jobs:
plan:
steps:
- name: Detect changed environments
# Only plan what changed
- name: terraform init
- name: terraform validate
- name: terraform plan
# Output saved as artifact
- name: Post plan as PR comment
# Reviewers see exact changes
apply:
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- name: terraform apply
# Uses the saved plan — no drift between plan and apply
Critical rules:
- Plan on PR — Every reviewer sees the exact changes before merge
- Apply on merge — Merging to main triggers the apply. No manual steps.
- Saved plans — The plan artifact from PR is what gets applied. No surprises.
- Changed-only — Don't plan all 50 environments when one file changed
Drift Detection
Infrastructure drifts. Someone clicks in a console, a security team adds a rule manually, an auto-scaler changes instance counts. You need to detect it:
# Run nightly via cron
terragrunt run-all plan -detailed-exitcode
# Exit code 0 = no changes (in sync)
# Exit code 1 = error
# Exit code 2 = changes detected (DRIFT!)
We run drift detection nightly. Any drift triggers a Slack alert and a ticket. The resolution is always the same: either update the Terraform to match reality, or re-apply to match Terraform.
Secrets Handling
Never put secrets in .tfvars files. Options:
- Environment variables —
TF_VAR_db_passwordset in CI - Vault provider — Read secrets from HashiCorp Vault at plan time
- SOPS — Encrypt secret
.tfvarsfiles with age/KMS keys - 1Password/AWS SSM — Reference secrets by path, resolve at apply time
We prefer the Vault provider for production and SOPS for smaller setups.
Common Mistakes We Fix
| Mistake | Fix |
|---|---|
| One giant state file | Split by resource group |
| Copy-pasting between environments | Use Terragrunt or workspaces |
| Running apply locally | CI/CD only, saved plans |
| No state locking | DynamoDB/equivalent, always |
| Hardcoded values in modules | Variables with validation |
| No drift detection | Nightly plan cron |
| Secrets in state, unencrypted | Encrypted backend + sensitive outputs |
Getting Started
If your Terraform is already a mess, don't try to fix everything at once:
- Week 1 — Move state to a remote encrypted backend with locking
- Week 2 — Split your monolithic state into resource groups
- Week 3 — Extract your first shared module
- Week 4 — Set up CI/CD with plan-on-PR
Or let us handle it. We've migrated teams from "one big main.tf" to scalable multi-environment setups in under a month.