Terraform: Infrastructure as Code, the DevOps Way

Terraform workflow: Write .tf, terraform plan, terraform apply, remote state backend

Click-ops is fast for one server. It is slow for ten and unmaintainable past thirty. Terraform replaces the cloud console with declarative configuration files under version control. You write what infrastructure should exist, Terraform figures out the API calls to make it match.

The payoff is enormous: every environment is rebuildable from git, every change is reviewable, and on-call has a much shorter list of mysteries to investigate at 3am.

The three commands you live in

terraform init — download providers and configure the backend.
terraform plan — preview what will change.
terraform apply — make it so.

Two more commands matter:

terraform fmt — autoformat .tf files; run it in pre-commit and CI.
terraform destroy — tear down everything in the state file. Powerful in dev, devastating in prod.

A working AWS example

terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket         = "my-tfstate"
    key            = "prod/network.tfstate"
    region         = "us-east-1"
    dynamodb_table = "tf-locks"
    encrypt        = true
  }
}

provider "aws" { region = "us-east-1" }

data "aws_availability_zones" "available" { state = "available" }

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = { Name = "prod-vpc", Environment = "prod" }
}

resource "aws_subnet" "public" {
  for_each                = toset(["a", "b", "c"])
  vpc_id                  = aws_vpc.main.id
  availability_zone       = "us-east-1${each.key}"
  cidr_block              = "10.0.${index(["a","b","c"], each.key)}.0/24"
  map_public_ip_on_launch = true
  tags = { Name = "public-${each.key}" }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table_association" "public" {
  for_each       = aws_subnet.public
  subnet_id      = each.value.id
  route_table_id = aws_route_table.public.id
}

terraform init && terraform apply and you have a VPC with three public subnets, an internet gateway, and the route table to make traffic flow. Run it again and Terraform reports "no changes" because the live state matches the desired state.

State — the file you must protect

Terraform tracks what it created in a state file. The state file maps the resources in your .tf files to the actual IDs returned by the cloud provider — without it, Terraform has no idea that the VPC vpc-0123abcd corresponds to your aws_vpc.main resource.

Treat state like a database:

Store it remotely (S3 + DynamoDB lock, GCS, Azure Blob, Terraform Cloud). Never check it into git — it can contain secrets.
Lock it during operations. DynamoDB locking prevents two engineers from running apply against the same state simultaneously.
Encrypt it at rest. Both the bucket and the DynamoDB table.
Restrict access. Only the CI role and a small ops group should be able to read it.
Never edit it by hand. terraform import and terraform state mv are the supported tools for adjusting state.

Remote state and locking architecture

Terraform CLI talks to S3 for state, DynamoDB for the lock, and the cloud provider APIs for actual resource changes

The diagram shows the typical AWS-flavoured remote state setup, but the same architecture applies with GCS + Cloud Storage object locks, Azure Blob with native leasing, or Terraform Cloud / Terraform Enterprise managing both for you.

When you run terraform apply, the CLI executes a fixed sequence:

Acquire the lock. Terraform writes a row to the DynamoDB lock table keyed by the state file path. If a row already exists, the apply aborts with "state lock held" and the lock holder's metadata. This is the single most important guarantee Terraform provides — two engineers cannot mutate the same infrastructure simultaneously.
Refresh state. Terraform reads the current state file from S3 and queries the cloud provider to detect drift between the recorded state and reality.
Compute the plan. Terraform diffs desired (your .tf) against current (refreshed state) and produces a list of create/update/delete actions.
Apply changes. Each action becomes a provider API call. As resources are created or updated, their new attributes are written back into the state file.
Persist state. The new state is uploaded to S3 (versioned, so the previous state is recoverable). The lock row is deleted.

Two failure modes deserve specific operational responses:

A crashed apply leaves the lock orphaned. terraform force-unlock <LOCK_ID> removes it manually. Always check the actual cloud state before unlocking — if the apply made partial changes, the next run needs to know.
State file corruption (rare, but possible) is recoverable from the bucket's version history. aws s3api list-object-versions and rolling back to the previous version is a documented procedure that should live in your runbook.

For production accounts, three additional controls are worth the small operational cost:

Enable bucket versioning and MFA delete on the state bucket. Accidental s3api delete-object of the state file is otherwise unrecoverable.
Use a separate state file per blast radius. A network-layer state file separate from per-service state files means a refresh on one does not delay the others, and a corrupted file affects only one service.
Run a nightly drift check. A scheduled CI job runs terraform plan against production with the same backend and posts to Slack if the plan is non-empty. Drift is almost always someone making a manual change in the console; catching it within 24 hours keeps state and reality from diverging too far.

Modules — reuse, don't copy-paste

The fastest way to drown in Terraform is to copy the same VPC + subnet + IGW pattern into every environment. Extract it into a module:

# modules/vpc/main.tf
variable "name"       { type = string }
variable "cidr_block" { type = string }

resource "aws_vpc" "this" {
  cidr_block = var.cidr_block
  tags = { Name = var.name }
}
output "vpc_id" { value = aws_vpc.this.id }

# environments/prod/main.tf
module "vpc" {
  source     = "../../modules/vpc"
  name       = "prod-vpc"
  cidr_block = "10.0.0.0/16"
}

Bug fix once, every environment benefits. Pin module versions when sourcing from a separate repo (source = "git::ssh://git@github.com/org/tf-modules.git//vpc?ref=v1.4.0") so a breaking change cannot slip into production unnoticed.

Workspaces vs directories

Terraform workspaces let you reuse one config with multiple state files. They are tempting for "dev/stage/prod" splits but in practice cause confusion: someone forgets to switch workspaces and applies dev changes to prod.

A safer pattern is separate directories per environment, each with its own backend block pointing at a distinct state file. The duplication is a small price for explicitness.

Variables and secrets

Variables make modules reusable, but never put secrets directly in *.tfvars files committed to git. Use:

Environment variables: TF_VAR_db_password.
A secret manager (AWS Secrets Manager, Vault) read via a data source.
CI-injected variables marked sensitive: variable "db_password" { sensitive = true }.

Sensitive variables are redacted from plan and apply output, which is the single biggest accidental leak vector.

CI/CD integration

A healthy Terraform workflow looks like this:

Engineer opens a PR with .tf changes.
CI runs terraform fmt -check, terraform validate, tflint, and tfsec.
CI runs terraform plan -out=tfplan and posts the human-readable plan as a PR comment.
Reviewers approve based on the plan.
After merge, a separate CI job runs terraform apply tfplan against the merged commit.

A few CI rules that have saved teams from incidents:

Always plan against the merge commit, not the branch tip. Otherwise the plan you reviewed and the plan you applied diverge.
Reject plans that touch destructive operations (-destroy, replace) unless explicitly approved.
Run drift detection nightly. A scheduled terraform plan against production should report "no changes." If it does not, someone made a manual change in the console — investigate.

Best practices, distilled

Pin provider versions with ~> (allow patch updates only). Module versions should be pinned to exact tags.
One module per concept — VPC, RDS, ECS service. Keep modules small enough to understand in one screen.
Use for_each, not count, for collections of similar resources. Removing one item with count reshuffles indices and recreates everything.
Tag everything. A consistent tagging policy (Environment, Owner, CostCenter) makes cost reports and incident response infinitely easier.
Document modules with examples in an examples/ folder. Examples double as integration tests.
Run terraform plan on every PR even for "trivial" changes. Surprises in the plan are why this workflow exists.

Common pitfalls

Editing resources in the cloud console after they were created by Terraform — the next apply will revert your changes or, worse, fail with confusing errors.
Sharing one state file across unrelated services. State is monolithic; a slow refresh on a giant state file blocks everyone.
Using local-exec provisioners as a substitute for proper provider resources. They are not idempotent and break in CI.
Catching a "destroy and recreate" plan in a critical resource (database, EBS volume) without realizing it. Always read plans carefully.

Testing your Terraform code

Treating infrastructure as code means you can test it like code. The testing pyramid for Terraform looks roughly like this:

Static checks first. terraform fmt -check, terraform validate, tflint, and tfsec run in seconds and catch most typos, deprecated arguments, missing required attributes, and obvious security misconfigurations.
Plan as a contract. A PR that produces a plan you did not expect is a failed test. CI should post the plan output as a PR comment so reviewers see what apply will actually do.
Module unit tests with Terraform's native test framework (Terraform 1.6+). Write .tftest.hcl files that apply a module against fake inputs and assert on outputs. Cheap, fast, no real cloud calls.
Integration tests with Terratest. A Go test runs terraform apply against a real (usually ephemeral) account, makes assertions against the live infrastructure, then terraform destroys it. Slow but high-confidence; reserve for shared modules and critical paths.
Policy as code. Open Policy Agent / Conftest or HashiCorp Sentinel rules block plans that violate organizational policy ("no public S3 buckets," "all RDS instances must be encrypted"). They run in CI between plan and apply, before changes hit production.

Each layer catches a different class of bug. A team that runs only fmt and validate in CI ships a meaningful number of preventable production issues; a team that runs the full pyramid catches almost all of them in PR review.

Importing existing infrastructure

Most teams adopting Terraform do not start from a greenfield account. They have years of click-ops to bring under management. terraform import adds an existing resource to state without recreating it:

# Write a placeholder resource block in your .tf
resource "aws_vpc" "legacy" {}

# Import the live VPC into that block
terraform import aws_vpc.legacy vpc-0123abcd

# Now run plan; fill in attributes until plan reports no changes
terraform plan

The workflow is iterative: import, run plan, copy the live attributes into your .tf until plan shows zero diff. It is tedious for hundreds of resources, but tools like terraformer and Terraform 1.5+ import blocks (declarative imports) speed it up considerably. Take it one resource type at a time — VPCs first, then subnets, then security groups, then workloads. Within a few sprints, even a long-neglected account becomes git-managed.

Where to go next

Once your infra lives in Terraform, the natural next steps are policy as code (OPA/Conftest, Sentinel, or HashiCorp's checks), cost estimation in CI (Infracost), and a module registry (private if your modules are sensitive).

The mindset shift matters more than any individual feature: infrastructure is software, deserves the same review process as software, and benefits from the same compounding leverage that good software engineering practices provide.

A maturity checkpoint worth aiming for: every change to production infrastructure starts as a PR, runs the same set of CI checks every other PR runs, and gets a recorded plan attached. When that workflow is the only path that exists, infrastructure incidents drop sharply, audit conversations become five-minute exercises, and onboarding a new engineer means pointing them at the same repo every other engineer already lives in.

Terraform: Infrastructure as Code, the DevOps Way

Terraform: Infrastructure as Code, the DevOps Way

The three commands you live in

A working AWS example

State — the file you must protect

Remote state and locking architecture

Modules — reuse, don't copy-paste

Workspaces vs directories

Variables and secrets

CI/CD integration

Best practices, distilled

Common pitfalls

Testing your Terraform code

Importing existing infrastructure

Where to go next

More terraform Articles

Related internal resources