TL;DR

The pipeline inspects which .hcl files changed and maps them to the exact scope in the directory hierarchy — change root.hcl and all environments plan; change one module’s terragrunt.hcl and only that module plans. The output is a JSON matrix that GitHub Actions fans out into parallel jobs. PR runs plan-only; merge to main applies. OIDC handles credentials per environment. The whole thing is ~300 lines of YAML and shell — no external orchestration layer.

The matrix that blew up production (almost)

We had just added a fourth AWS account to the live repo. The pipeline was generating the change-detection matrix, but the script had a regex that matched account directories by looking for account.hcl. The new account directory didn’t have one yet — we’d added the region.hcl and env.hcl but hadn’t created the account file.

The matrix came back empty. The pipeline “succeeded” with no jobs. Nobody noticed for two days because there were no PR failures. When someone finally ran plan manually and saw drift, we traced it back to the gap.

The fix was simple: add a validation step that asserts the matrix is non-empty when relevant .hcl files have changed. But the lesson was sharper: an empty matrix is a silent failure. Build the assertion in from day one.

The change detection hierarchy

The detection logic has five levels, evaluated in priority order:

Changed fileScopeMatrix result
root.hcl, proj.hclAll environments, all accountsFull matrix
{account}/account.hclAll envs in that accountAccount-scoped matrix
{account}/{region}/region.hclAll envs in that account/regionRegion-scoped matrix
{account}/{region}/{env}/env.hclThat environment onlySingle-env matrix
{account}/{region}/{env}/{module}/**That module only (if terragrunt.hcl present)Single-module matrix

The detection script outputs a JSON array that maps directly to the filesystem:

[
  {
    "env": "dev/eu-west-1/dev",
    "env_dir": "dev/eu-west-1/dev",
    "account": "dev",
    "region": "eu-west-1",
    "environment": "dev",
    "change_type": "module_level",
    "module": "eks"
  }
]

When change_type is module_level, the plan job runs terragrunt plan on just that module directory. Any other change_type runs terragrunt run --all plan on the full environment. This is the precision that keeps the pipeline fast on targeted changes — a terragrunt.hcl edit in dev/eu-west-1/dev/eks/ doesn’t trigger terraform init for the VPC, RDS, and secrets modules.

The PR workflow

# .github/workflows/pr.yaml
on:
  pull_request:
    branches: [main]
    paths: ['**/*.hcl', '**/*.tf', '**/*.tfvars']

permissions:
  id-token: write    # OIDC
  contents: read
  pull-requests: write  # post plan comments

jobs:
  detect-changes:
    outputs:
      matrix: ${{ steps.detect.outputs.matrix }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # full history for accurate diff
      - id: detect
        run: .github/scripts/detect-changes.sh

  terragrunt-plan:
    needs: detect-changes
    if: needs.detect-changes.outputs.matrix != '[]'
    strategy:
      matrix:
        include: ${{ fromJson(needs.detect-changes.outputs.matrix) }}
      fail-fast: false  # one env failing doesn't cancel others
    environment: ${{ matrix.account }}-${{ matrix.environment }}
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ matrix.region }}

      - uses: autero1/action-terragrunt@v3
        with:
          terragrunt-version: "0.77.20"

      - name: Init
        run: |
          terragrunt run --all init \
            --non-interactive \
            --working-dir ${{ matrix.env_dir }}

      - name: Plan
        id: plan
        run: |
          if [ -n "${{ matrix.module }}" ]; then
            # Module-level change: plan just that module
            terragrunt plan --working-dir ${{ matrix.env_dir }}/${{ matrix.module }}
          else
            # Environment-level change: fan out across all modules
            terragrunt run --all plan \
              --non-interactive \
              --working-dir ${{ matrix.env_dir }}
          fi

      - name: Comment plan on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '### Plan: `${{ matrix.env }}`\n```\n${{ steps.plan.outputs.stdout }}\n```'
            })

fail-fast: false is non-negotiable. You want to see the plan for every affected environment, even if one fails. Stopping the matrix at the first failure hides information at exactly the moment you need it most.

The apply workflow

The apply workflow is structurally identical to PR — same detection logic, same matrix shape. The only differences are the trigger (push to main) and the Terragrunt command (apply instead of plan):

# .github/workflows/ci-cd.yaml (delta from PR workflow)
on:
  push:
    branches: [main]
    paths: ['**/*.hcl', '**/*.tf']

# ...same jobs, different final step:
      - name: Apply
        run: |
          terragrunt run --all apply \
            --non-interactive \
            --working-dir ${{ matrix.env_dir }}

Apply runs automatically on merge. The PR plan is the approval gate — reviewers see exactly what will happen, the merge is the consent. If your org requires an explicit human click between plan and apply, GitHub environment protection rules support required_reviewers — set it on the prod-prod environment, leave dev-dev automatic.

The role ARN problem: three solutions

Every matrix entry needs to assume the right IAM role for its account. Three approaches, in order of preference:

Option 1 — GitHub environment secrets (recommended)

Name environments {account}-{environment} (e.g., dev-dev, prod-prod). Each environment has its own AWS_ROLE_ARN secret. The workflow sets environment: ${{ matrix.account }}-${{ matrix.environment }}, GitHub resolves the right secret automatically.

This is the cleanest: per-environment secrets, per-environment protection rules, zero logic in the workflow.

Option 2 — Naming convention with account ID secrets

role-to-assume: arn:aws:iam::${{ secrets[format('AWS_ACCOUNT_ID_{0}', upper(matrix.account))] }}:role/GitHubActionsRole

Requires AWS_ACCOUNT_ID_DEV, AWS_ACCOUNT_ID_PROD, etc. as repo secrets. Works well for 2–3 accounts.

Option 3 — Hardcoded map

role-to-assume: |
  ${{ matrix.account == 'dev'  && 'arn:aws:iam::222222222222:role/GitHubActionsRole' ||
      matrix.account == 'prod' && 'arn:aws:iam::333333333333:role/GitHubActionsRole' ||
      secrets.AWS_ROLE_ARN }}

Avoid for more than two accounts. Account IDs in pipeline YAML are a maintenance hazard.

Manual override inputs

For hotfixes and post-incident re-applies:

on:
  workflow_dispatch:
    inputs:
      target_account:
        description: 'Scope to a single account (dev / prod / mgmt / all)'
        default: 'all'
      force_run:
        description: 'Ignore change detection — run all environments'
        type: boolean
        default: false

force_run: true bypasses the diff entirely. target_account filters the matrix to a single account’s environments. These two inputs cover 95% of the “I need to manually trigger a specific environment” scenarios.

Do you need Atlantis or Terraform Cloud?

Honest answer: maybe. Here’s the heuristic I use:

If your team already lives in GitHub Actions, wants the orchestration logic version-controlled alongside the infrastructure, and operates in a restricted or air-gapped environment where SaaS isn’t an option — this pipeline is the right fit. ~300 lines of YAML and shell, full visibility, no external state.

If you want UI-driven approval workflows, cost estimation, or drift detection out of the box — Terraform Cloud, env0, or Spacelift are worth evaluating. If you want GitOps-native PR automation with minimal config — Atlantis or Terrateam get you there faster than building this pipeline from scratch.

The pipeline in this series isn’t for everyone. It’s for teams who want to understand and own what’s running.

Coming up next

Part 5 covers the GitLab CI equivalent: the web_identity_token OIDC pattern for keyless AWS auth, shared CI templates for DRY pipeline definitions across multiple repos, and the hard rule that destroy is always when: manual.

Reference: tf-live demo repo · hagzag/tf-modules


Series Navigation