Terraform is powerful, but it’s easy to create a mess. We’ve seen 10,000-line main.tf files, state files checked into git, and teams afraid to run terraform apply.
It doesn’t have to be this way.
Structure: Start Right
Module Organization
Break your infrastructure into logical modules:
terraform/
├── modules/
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── database/
│ ├── compute/
│ └── monitoring/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── prod/
└── global/
└── iam/
Each module should have a single responsibility. Networking configures VPCs and subnets. Compute manages ECS or EC2. Don’t mix them.
State Management
Never commit state files. Use remote backends:
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
The DynamoDB table enables state locking, preventing two people from running applies simultaneously and corrupting state.
Variables and Locals
Use Variables for Inputs
Anything that changes between environments should be a variable:
variable "environment" {
description = "Environment name (dev, staging, prod)"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "instance_count" {
description = "Number of EC2 instances"
type = number
default = 2
validation {
condition = var.instance_count >= 1 && var.instance_count <= 10
error_message = "Instance count must be between 1 and 10."
}
}
Notice the validation blocks—they catch configuration errors before you apply.
Use Locals for Computed Values
Locals are for DRY (Don’t Repeat Yourself) values:
locals {
common_tags = {
Environment = var.environment
ManagedBy = "Terraform"
Project = var.project_name
}
name_prefix = "${var.project_name}-${var.environment}"
# Conditional logic
db_instance_class = var.environment == "prod" ? "db.r5.xlarge" : "db.t3.medium"
}
resource "aws_instance" "app" {
ami = var.ami_id
instance_type = var.instance_type
tags = merge(
local.common_tags,
{
Name = "${local.name_prefix}-app-server"
}
)
}
Data Sources: Use Them Wisely
Data sources query existing infrastructure. Use them to reference resources you don’t manage with Terraform:
# Look up existing VPC
data "aws_vpc" "main" {
tags = {
Name = "main-vpc"
}
}
# Find the latest AMI
data "aws_ami" "amazon_linux_2" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
}
resource "aws_instance" "app" {
ami = data.aws_ami.amazon_linux_2.id
subnet_id = data.aws_vpc.main.subnet_ids[0]
instance_type = "t3.medium"
}
Warning: Data sources run on every plan/apply. If the data source queries something that changes frequently, your plans become unpredictable.
Modules: Think Like a Library
Good modules are reusable, well-documented, and have clear interfaces.
Example: A VPC Module
# modules/networking/main.tf
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
}
variable "availability_zones" {
description = "List of AZs to use"
type = list(string)
}
variable "environment" {
description = "Environment name"
type = string
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.environment}-vpc"
Environment = var.environment
}
}
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index)
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.environment}-private-${var.availability_zones[count.index]}"
Type = "private"
}
}
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "private_subnet_ids" {
description = "IDs of private subnets"
value = aws_subnet.private[*].id
}
Using the Module
# environments/prod/main.tf
module "networking" {
source = "../../modules/networking"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
environment = "prod"
}
module "database" {
source = "../../modules/database"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
}
Secrets: Never in Code
Bad:
resource "aws_db_instance" "main" {
username = "admin"
password = "super-secret-password" # ❌ NO!
}
Good:
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/db/master-password"
}
resource "aws_db_instance" "main" {
username = "admin"
password = jsondecode(data.aws_secretsmanager_secret_version.db_password.secret_string)["password"]
}
Or use variables and pass secrets via environment variables:
export TF_VAR_db_password=$(aws secretsmanager get-secret-value --secret-id prod/db/password --query SecretString --output text)
terraform apply
Count vs For_Each
Use count for simple resource duplication:
resource "aws_instance" "web" {
count = var.instance_count
ami = var.ami_id
instance_type = "t3.medium"
tags = {
Name = "web-${count.index + 1}"
}
}
Use for_each when resources have distinct identities:
variable "users" {
type = map(object({
role = string
}))
default = {
"alice" = { role = "admin" }
"bob" = { role = "developer" }
}
}
resource "aws_iam_user" "users" {
for_each = var.users
name = each.key
tags = {
Role = each.value.role
}
}
The key difference: for_each uses stable identifiers (keys), so removing “alice” doesn’t affect “bob”. With count, removing index 0 shifts everything.
Drift Detection
Infrastructure changes outside Terraform happen. Catch it:
# Regular drift detection in CI
terraform plan -detailed-exitcode
# Exit code 0 = no changes
# Exit code 1 = error
# Exit code 2 = successful plan with changes
Set up a daily job that runs terraform plan and alerts if drift is detected.
Testing
Use terraform validate and terraform fmt in CI:
# .github/workflows/terraform.yml
name: Terraform
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Init
run: terraform init -backend=false
- name: Terraform Validate
run: terraform validate
For integration testing, consider Terratest.
Common Anti-Patterns
❌ Massive monolithic main.tf files
✅ Break into modules
❌ Copy-paste between environments
✅ Use shared modules with environment-specific variables
❌ Hardcoded values everywhere
✅ Use variables and data sources
❌ No state locking
✅ Use DynamoDB for state locks
❌ Manual terraform apply in production
✅ Use CI/CD with approvals
The Payoff
Teams following these practices typically see:
- 70% reduction in “state conflict” incidents
- Faster onboarding for new team members
- Environment parity (dev actually matches prod)
- Confident, routine infrastructure changes
Terraform done right is a force multiplier. Done wrong, it’s a liability.
Start with good structure, and the rest follows.