
Most slow or failing Terraform runs trace back to a short list of causes: oversized dependency graphs, provider processes eating memory, caching that was configured in the wrong place, and credentials that fail in non-obvious ways. This guide covers those causes and their fixes — error by error, with numbers from incidents we've actually debugged in Scalr's support queue.
Before diving into advanced debugging, start with these quick checks that resolve many common errors in seconds.
Read the Error Message: Start here. Terraform's error messages are descriptive, pointing to exact line numbers and the nature of the problem—whether it's a syntax error, missing argument, or provider issue.
Run terraform validate: Check your configuration files for syntax and internal consistency without accessing remote state or services. This catches typos and structural mistakes immediately.
Run terraform plan: This "dry run" shows exactly what Terraform intends to create, modify, or destroy. It catches logical errors like misconfigured dependencies or unexpected changes to existing resources.
Check Provider Credentials and Permissions: A common source of issues is insufficient permissions. Ensure your credentials (access keys, IAM roles) have necessary permissions to create and manage your resources.
When initial troubleshooting doesn't provide enough information, enable Terraform's detailed logging using the TF_LOG environment variable:
Enable logging with:
export TF_LOG=DEBUG
terraform plan
# Save logs to file for easier analysis
export TF_LOG=TRACE
export TF_LOG_PATH="terraform-debug.log"
terraform applyThe state file is a JSON document mapping resources in your configuration to real-world infrastructure. Inconsistencies here cause errors.
Use these commands to inspect state:
# List all resources Terraform is managing
terraform state list
# Inspect attributes of a specific resource
terraform state show <resource_address>
# Update state file to reflect current infrastructure
terraform refreshThe terraform console is an interactive environment for evaluating expressions and testing logic:
terraform console
# Check variable values
> var.instance_type
"t2.micro"
# Test complex expressions
> flatten([["a", "b"], ["c"]])
[ "a", "b", "c" ]The InvalidClientTokenId error (HTTP 403) is one of the most frustrating AWS authentication issues. It appears in different forms but always halts deployments.
Based on analysis of thousands of error reports:
Step 1: Clear conflicting environment variables
unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN AWS_SECURITY_TOKENStep 2: Regenerate credentials without special characters. If your AWS secret key contains +, /, or other special characters, regenerate new credentials immediately.
Step 3: Verify credential status
aws configure list
aws sts get-caller-identityStep 4: Add missing session token for temporary credentials
provider "aws" {
access_key = "ASIA..." # Temporary keys start with ASIA
secret_key = "..."
token = "..." # Often missing for temporary credentials
region = "us-east-1"
}AWS secret access keys can contain 40 alphanumeric characters plus forward slash (/) and plus (+). These special characters cause failures through three mechanisms:
URL encoding failures: AWS signature calculation requires proper URL encoding. Forward slashes must be %2F, plus signs must be %2B.
Shell interpretation issues: Different shells process special characters differently, corrupting the credentials during transmission.
Base64 encoding corruption: Credentials often undergo base64 encoding where special characters can be corrupted during encode/decode cycles.
Most problematic characters ranked by frequency:
Option 1 (Recommended): Keep regenerating AWS access keys until you get ones without special characters.
Option 2: Use credential files instead of environment variables:
# ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = kWcrlUX5JEDGM/LtmEENI/aVmYvHNif5zB+d9+ctCredential files handle special characters more reliably than environment variables.
Option 3: Proper shell escaping (less reliable)
export AWS_SECRET_ACCESS_KEY='kWcrlUX5JEDGM/LtmEENI+aVmYvHNif5zB+d9+ct'Terraform follows this strict credential hierarchy:
Environment variables override credential files, causing authentication failures when old variables exist.
GitHub Actions with OIDC (Recommended):
name: Terraform
on: [push]
permissions:
id-token: write
contents: read
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions
aws-region: us-west-2
- name: Terraform Init
run: terraform init
- name: Terraform Plan
run: terraform planDocker volume mount (production recommended):
docker run -v $HOME/.aws:/root/.aws:ro \
-v $(pwd):/workspace \
-w /workspace \
hashicorp/terraform:latest planState locks prevent concurrent modifications that lead to corruption. Common scenarios and solutions are covered in detail in our emergency solutions and prevention guide for state lock errors.
When a lock persists after an operation completes, use:
terraform force-unlock LOCK_IDAlways enable state locking with remote state backends (AWS DynamoDB, GCP, Scalr, Terraform Cloud) to prevent concurrent writes. If state corruption does occur, see empty Terraform state file recovery for restoration steps.
Memory problems are common when scaling infrastructure. Terraform and OpenTofu are written in Go and must load massive amounts of data into RAM for safety. The AWS provider is one of the worst offenders — see our AWS Provider memory explosion survival guide for v4.67.0+ for AWS-specific mitigations.
Provider Schema Loading (Primary Cause): When you run plan, Terraform loads the entire schema for every provider. Large providers like AWS (400-800MB+ RAM), Azure (500MB-1GB+), and Google (300-600MB) have enormous schemas. Multiple provider versions or aliases spawn separate processes, each with its own memory overhead.
The multiplication is what catches people. One customer hit The run has reached the memory limit (3072m) after changing a single variable on a workspace that had been stable for months. The debug output showed azurerm instantiated 11 times — a module plus a remote-state read each spun up their own provider copy — and each instantiation is a separate process consuming several hundred megabytes. The workspace had been sitting at 99% of its 3 GB ceiling the whole time (3.047 GB at the kill); a >= 4.57.0 version constraint had let the provider auto-float to a heavier release, and that pushed it over. The fixes were unglamorous: consolidate provider blocks, pin provider versions, and eventually split the workspace.
Dependency Graph Complexity: Before creating any resource, the engine builds a directed acyclic graph (DAG). A configuration with 1,000 resources consumes hundreds of megabytes tracking every dependency and variable.
Graph cost is driven by dependency edges, not resource count. One customer blew through a 4096m memory limit during plan on a workspace with only seven resources, right after upgrading the AWS provider to 6.39.0. The memory was going to graph construction — specifically the AttachDependenciesTransformer step — because a broad module-level depends_on combined with four to five levels of module nesting (module.a.module.b.module.c.module.d) produced dependency lists spanning nearly the entire configuration. The graph build alone ran about two minutes before the kill. Scoping depends_on to specific outputs and flattening the nesting resolved it.
State File Bloat: Large monolithic state files must be fully parsed and held in memory during the "Refreshing State" phase.
1. Enable Provider Schema Caching (Terraform 1.6+, OpenTofu 1.7+):
Create or edit ~/.terraformrc:
plugin_cache_dir = "$HOME/.terraform.d/plugin-cache"
Or set as environment variable:
export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"Symlinks provider binaries rather than copying, reducing I/O memory spikes.
One caveat for remote execution: a team running agents on ECS set TF_PLUGIN_CACHE_DIR and saw no effect at all. In remote runs the runner manages init, so an environment variable set in the workspace never reaches it — caching has to be configured at the agent layer instead. On Scalr agents that means SCALR_AGENT_CACHE_DIR plus SCALR_AGENT_PROVIDER_CACHE_ENABLED (module caching is available the same way), with the cache directory on persistent storage such as EFS so it survives container replacement. Once moved to the agent layer, caching worked as expected.
2. Use AWS Enhanced Region Support (AWS Provider 6.0+):
Before: 20 aliases = 20 processes (8GB+ RAM) After: 1 provider = 1 process (800MB RAM)
Old approach with aliases:
provider "aws" {
region = "us-east-1"
}
provider "aws" {
alias = "west"
region = "us-west-2"
}
resource "aws_vpc" "west" {
provider = aws.west
cidr_block = "10.1.0.0/16"
}New approach with resource-level region:
provider "aws" {
region = "us-east-1"
}
resource "aws_vpc" "west" {
region = "us-west-2" # No alias needed
cidr_block = "10.1.0.0/16"
}3. Reduce Parallelism:
terraform plan -parallelism=3Trade speed for stability by reducing concurrent operations from the default 10.
4. Split Monolithic State Files:
If state files exceed 50MB, refactor into logical stacks (network, data-layer, app-layer). Smaller dependency graphs mean lower memory ceilings.
If you're upgrading the AWS provider, also review what's breaking in AWS Provider v6.0 before bumping versions.
Skip expensive validation calls and optimize retries:
provider "aws" {
region = "us-east-1"
# Skip expensive validation calls
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
# Request tokens for idempotency
retry_mode = "standard"
max_retries = 25
}Use resource-specific timeout configuration:
resource "aws_db_instance" "main" {
identifier = "primary-database"
engine = "postgres"
timeouts {
create = "40m"
update = "80m"
delete = "40m"
}
}When resources fail to apply due to dependency problems, verify with:
terraform graph | grep depends_ondepends_on deserves more suspicion than it usually gets, and module-level depends_on most of all. A team running self-hosted agents watched plan times degrade from 8 minutes to 40 while their resource count barely moved (1,150 to 1,190). Hardware was ruled out first — the agents ran on r6a.2xlarge EKS nodes with 2-CPU/19Gi pods, nowhere near saturated. The debug logs pointed elsewhere: a single module-level depends_on referencing entire modules gave every resource in the configuration 1,347 dependencies during the plan walk — and zero during the validate walk, which is why terraform validate stayed fast and made the problem look like infrastructure rather than configuration.
The cost compounds because Terraform walks the graph multiple times per run: once for validate, once for plan, and once preparing the apply. The logs showed three nearly identical silent gaps — 624, 617, and 615 seconds — accounting for roughly 30 of the 38 minutes. Scoping the depends_on to the specific output it actually needed brought plans back to normal. The general rule: a module-level depends_on makes every resource in the dependent module wait on every resource in the referenced module, and you pay that bill on every graph walk.
Apply specific resources during troubleshooting:
terraform plan -target=aws_instance.web
terraform apply -target=aws_instance.webUnderstand the difference:
terraform refresh: Updates state file without planning changesterraform plan: Plans changes based on current state-refresh=false on known-stable infrastructure for 20-40% faster plansTerraform doesn't maintain native "undo" history. A rollback is a "roll forward" operation using a previous configuration.
The most reliable rollback approach keeps Git as your source of truth:
git revert <commit_id> to create a new commit inverting breaking changesterraform plan to verify the delta, then terraform applyThis keeps Git history synchronized with actual infrastructure.
If deployment fails severely and the state file is corrupted:
terraform plan to verify Terraform recognizes restored stateWarning: Manual state manipulation is risky. Use terraform state push only as an advanced user.
Minimize rollback risk by running two identical environments:
Build Green from scratch, verify it fully, then switch traffic via DNS or load balancer rules. Rollback is instantaneous by redirecting traffic back to Blue. Once Green is stable, decommission Blue with terraform destroy.
Limit Blast Radius: Avoid monolithic state files. Decouple infrastructure into smaller modules and workspaces so failures don't affect unrelated components.
Protect Persistent Data: Use prevent_destroy = true on critical resources:
resource "aws_db_instance" "production" {
lifecycle {
prevent_destroy = true
}
}Enforce State Locking: Use remote backends supporting state locking (DynamoDB, GCP, Scalr, Terraform Cloud) and enable versioning for safety nets.
Validate with Testing: Integrate terraform test into CI/CD pipelines to verify expected resource attributes before touching real infrastructure. Complement with static analysis tools like Checkov or TFSec.
For an in-depth look at running Terraform at scale across many teams and environments, see Terraform operations at scale. Concurrency tuning is a key lever — read why Terraform concurrency matters for the trade-offs.
Terraform performance decreases based on resource count and state complexity:
Memory consumption scales at ~512MB per 1,000 resources, while plan time increases exponentially beyond 2,000 resources due to dependency graph complexity.
Before optimizing anything, measure where the time actually goes. In Scalr's support queue we regularly see runs reported as "taking 15 minutes" where 5-10 minutes is queue latency — the run waiting for a free runner slot — rather than execution. Teams hitting 10- or 60-concurrent-run caps end up queuing each other's work, and no amount of HCL tuning fixes a concurrency ceiling. In every one of those cases the fix was a support request, not an optimization: Scalr raises the concurrency limit on request at no extra cost, because the cap is a fraud-and-abuse control rather than a billing lever. That is also why concurrency-based pricing is a poor fit for IaC platforms — when parallel slots are the pricing metric, every queue-latency incident turns into a procurement exercise.
Slow execution also has causes outside the graph. A team migrating off Terraform Cloud saw a GCP workspace of about 290 resources go from 4-minute to 30-minute runs after the move — every run succeeded, just 15x slower. The silent ~28 minutes turned out to be the google-beta provider, pulled in by a project-factory module, repeatedly retrying authentication with backoff instead of failing fast against credentials that weren't wired up the same way in the new environment. Exporting the provider configuration as shell variables cut runs to under two minutes. When a run is slow but its plan output is unremarkable, check what the providers are doing during the quiet stretches before restructuring any code.

The single most impactful optimization is strategic state file splitting. Organizations report 70-90% reduction in operation times by dividing monolithic state into components:
Before: Monolithic state with 2,900 resources
terraform/
├── main.tf (all resources)
└── terraform.tfstate (300MB+)
After: Component-based splitting
terraform/
├── networking/
│ ├── main.tf (VPCs, subnets, security groups)
│ └── terraform.tfstate (15MB, 200 resources)
├── compute/
│ ├── main.tf (EC2 instances, ASGs, ELBs)
│ └── terraform.tfstate (25MB, 400 resources)
└── data/
├── main.tf (RDS, ElastiCache, S3)
└── terraform.tfstate (20MB, 300 resources)
Use Terraform 1.1+ moved blocks to migrate without destroying and recreating resources:
moved {
from = module.monolith.aws_vpc.main
to = aws_vpc.main
}Calculate optimal parallelism with:
AVAILABLE_MEMORY_GB=16
CPU_CORES=8
PROVIDER_RATE_LIMIT=100
# Memory constraint
MAX_MEMORY_PARALLELISM=$((AVAILABLE_MEMORY_GB * 1024 / 512))
# CPU constraint (10 ops per remaining core, reserve 2 cores)
MAX_CPU_PARALLELISM=$(((CPU_CORES - 2) * 10))
# Provider constraint
MAX_PROVIDER_PARALLELISM=$((PROVIDER_RATE_LIMIT / 2))
# Use minimum of all constraints
OPTIMAL_PARALLELISM=$(echo "$MAX_MEMORY_PARALLELISM $MAX_CPU_PARALLELISM $MAX_PROVIDER_PARALLELISM" | tr ' ' '\n' | sort -n | head -1)
terraform plan -parallelism=$OPTIMAL_PARALLELISMWell-designed modules following single responsibility principles improve performance:
# Good: Focused module with clear boundaries
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0.0"
name = "production-vpc"
cidr = "172.16.0.0/16"
azs = data.aws_availability_zones.available.names
}
# Avoid: Overly complex module with 50+ variables managing everything
module "everything" {
source = "./modules/kitchen-sink"
# Results in 1000+ resources in single module
}Composition patterns outperform inheritance:
# Enables parallel execution
module "base_network" {
source = "./modules/network"
}
module "application_layer" {
source = "./modules/application"
vpc_id = module.base_network.vpc_id
}
module "data_layer" {
source = "./modules/database"
vpc_id = module.base_network.vpc_id
}| Technique | Impact | Complexity | When to Apply |
|---|---|---|---|
| State Splitting | 70-90% reduction | Medium | > 500 resources or > 50MB state |
| Parallelism Tuning | 30-50% improvement | Low | > 100 resources |
| Provider Optimization | 40-60% API call reduction | Low | All deployments |
| Module Architecture | 40-60% faster init | High | New projects or major refactors |
| Disable Refresh | 20-40% faster plans | Low | Known-stable infrastructure |
| Provider Caching | 90% faster initialization | Medium | All CI/CD pipelines |
| Resource Targeting | 85-95% scope reduction | Low | Emergency fixes only |
| Backend Optimization | 10-30% I/O improvement | Medium | Large state files |
Initialization deserves its own line item if you run Terragrunt. One Terragrunt shop found that terragrunt run --all spent 3.5 minutes of a 5-minute operation window on init alone, before a single resource was planned — on top of Aurora and OpenSearch applies that need 30+ minutes by themselves. Init cost scales with unit count, which is exactly the cost that provider and module caching attacks.
The Problem with count: When you remove an item from the middle of a list, all subsequent resources shift indices, causing unnecessary destruction and recreation.
Example with count:
variable "user_names_count" {
type = list(string)
default = ["alice", "bob", "charlie"]
}
resource "aws_iam_user" "user_count" {
count = length(var.user_names_count)
name = var.user_names_count[count.index]
}If you remove "bob", aws_iam_user.user_count[1] (was "bob") now maps to "charlie", causing Terraform to recreate it.
The for_each Solution: Use stable key-based iteration:
variable "user_names_for_each" {
type = set(string)
default = ["alice", "bob", "charlie"]
}
resource "aws_iam_user" "user_for_each" {
for_each = var.user_names_for_each
name = each.key
}Removing "bob" only targets that user for destruction. "alice" and "charlie" remain untouched.
Key Distinction:
Best Practice: Prefer for_each for non-trivial cases.
Dynamic blocks generate nested blocks by iterating over collections, eliminating verbose HCL:
variable "ingress_rules" {
type = list(object({
port = number
cidr_blocks = list(string)
protocol = string
}))
default = [
{ port = 80, cidr_blocks = ["0.0.0.0/0"], protocol = "tcp" },
{ port = 443, cidr_blocks = ["0.0.0.0/0"], protocol = "tcp" },
{ port = 22, cidr_blocks = ["10.0.0.0/16"], protocol = "tcp" },
]
}
resource "aws_security_group" "example" {
name = "example-sg"
dynamic "ingress" {
for_each = var.ingress_rules
content {
from_port = ingress.value.port
to_port = ingress.value.port
protocol = ingress.value.protocol
cidr_blocks = ingress.value.cidr_blocks
}
}
}This is much cleaner than repeating ingress blocks manually.
Break down complex expressions using locals to improve readability:
# Hard to read
tags = {
Name = "app-${var.environment}-${var.app_name}-${var.is_primary_region ? "primary" : "secondary"}-${random_id.server.hex}"
}
# Clear with locals
locals {
region_type = var.is_primary_region ? "primary" : "secondary"
base_name = "app-${var.environment}-${var.app_name}"
instance_name = "${local.base_name}-${local.region_type}-${random_id.server.hex}"
}
tags = {
Name = local.instance_name
}Each part of the logic is now clearly named and easier to understand.
Monolithic modules try to manage too many related pieces (VPCs, subnets, security groups, load balancers, databases all together). This leads to inflexibility, high complexity, wide blast radius, and poor testability.
Composable modules focus on single responsibility. Combine in root configuration:
module "vpc" {
source = "./modules/vpc"
}
module "app_sg" {
source = "./modules/security_group"
vpc_id = module.vpc.vpc_id
}
module "database" {
source = "./modules/rds"
vpc_id = module.vpc.vpc_id
}Benefits: Flexibility, simplicity, clear boundaries, reduced blast radius.
Input Variables (variable blocks):
variable "instance_type" {
description = "The EC2 instance type"
type = string
default = "t3.micro"
}Local Values (locals blocks):
locals {
common_tags = {
Owner = "DevTeam"
Project = "WebApp"
}
instance_name = "app-server-${var.environment}"
}Key Distinction: Variables are for inputs, locals are for internal calculations and DRY principles.
Common Misconception: Using workspaces to manage dev/staging/prod from one codebase by varying inputs based on terraform.workspace. This becomes unmanageable with conditional logic littering your configuration.
Appropriate Use: Workspaces manage multiple instances of identical infrastructure differing only by input variables:
Better Approach for Environments: Use separate configuration directories or a platform hierarchy that maps environments to isolated scopes:

environments/
├── dev/
│ ├── main.tf
│ └── terraform.tfvars
├── staging/
│ ├── main.tf
│ └── terraform.tfvars
└── prod/
├── main.tf
└── terraform.tfvars
modules/
└── my_app/
Each environment directory instantiates common modules with environment-specific variables.
For the condensed version, see our top 5 best practices for Terraform. The list below expands on those plus more advanced practices:
terraform test and tftest into CI/CD to verify configurations produce expected resource attributes before applying.
