CI/CD Pipelines for Terragrunt: GitLab CI

TL;DR

GitLab CI authenticates to AWS without static keys using id_tokens — a short-lived JWT that AWS trusts via the GitLab OIDC provider. A YAML anchor before_script writes the token to disk and configures an AWS profile in four lines. Job templates (.init_template, .plan_template, .apply_template) define the Terragrunt commands once. Concrete jobs extend the templates with one variable each: TG_ENV_DIR. Apply is always when: manual. Destroy doesn’t exist in the default pipeline — ever.

The accidental destroy that wasn’t

During a GitLab CI migration, a junior engineer set up a new pipeline and copied an apply job template. The copy included a rules block that ran on main branch push without the when: manual flag. Nobody caught it in review because the job name was apply:staging and it looked legitimate.

The next push to main after a routine env.hcl update auto-applied the staging environment. It worked correctly — no outage. But it was a coin flip. If the change had been something that required state migration or had a dependency ordering issue, it would have partially applied and left staging in an inconsistent state with no human at the wheel.

We added a pipeline lint step that day: CI fails if any apply or destroy job is missing when: manual. Enforce it as policy, not convention.

OIDC authentication: four lines that replace static keys

GitLab CI’s id_tokens block mints a JWT when the job starts. The audience claim (aud) must match what your AWS IAM OIDC provider expects.

# The OIDC setup — shared via YAML anchor
.aws_profile_setup:
  before_script: &aws_setup
    - mkdir -p ~/.aws
    - echo "${GITLAB_OIDC_TOKEN}" > /tmp/web_identity_token
    - |
      cat <<EOF > ~/.aws/config
      [profile ${AWS_PROFILE}]
      role_arn = ${ROLE_ARN}
      web_identity_token_file = /tmp/web_identity_token
      EOF

Four lines. Every job that needs AWS just references before_script: *aws_setup. The JWT is written to /tmp/web_identity_token, AWS reads it via web_identity_token_file, exchanges it for short-lived credentials, and your Terragrunt commands run with the resulting session.

Verify it’s working:

- aws sts get-caller-identity --profile ${AWS_PROFILE}

If this returns the expected role ARN, the credential chain works. If it fails with InvalidClientTokenId, the OIDC provider isn’t trusted. If it fails with AccessDenied, the role ARN is wrong or the trust policy’s sub condition doesn’t match the GitLab project path.

ℹ️ OIDC audience for self-hosted GitLab

For GitLab.com, use aud: https://gitlab.com. For a self-hosted instance, use your GitLab instance URL (e.g., aud: https://gitlab.example.com). The AWS IAM OIDC provider must be configured with the same URL as the issuer. Mismatches between the aud claim and the OIDC provider configuration are the most common auth failure.

The job template pattern

Three abstract templates, each extending the OIDC setup:

variables:
  AWS_REGION: "eu-west-1"
  AWS_PROFILE: "oidc"
  ROLE_ARN: "arn:aws:iam::111111111111:role/gitlab-ci-oidc-role"
  LOG_LEVEL: "info"
  TG_EXCLUDE_ARGS: ""

.init_template:
  stage: init
  id_tokens:
    GITLAB_OIDC_TOKEN:
      aud: https://gitlab.com
  before_script: *aws_setup
  script:
    - aws sts get-caller-identity --profile ${AWS_PROFILE}
    - |
      git config --global url."https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/".insteadOf "https://gitlab.com/"
      terragrunt run --all init \
        --non-interactive \
        --log-level ${LOG_LEVEL} \
        --working-dir ${TG_ENV_DIR} \
        ${TG_EXCLUDE_ARGS}

.plan_template:
  stage: plan
  id_tokens:
    GITLAB_OIDC_TOKEN:
      aud: https://gitlab.com
  before_script: *aws_setup
  script:
    - |
      git config --global url."https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/".insteadOf "https://gitlab.com/"
      terragrunt run --all plan \
        --non-interactive \
        --log-level ${LOG_LEVEL} \
        --working-dir ${TG_ENV_DIR} \
        ${TG_EXCLUDE_ARGS}

.apply_template:
  stage: apply
  rules:
    - if: '$CI_COMMIT_BRANCH == "main" && $CI_PIPELINE_SOURCE == "push"'
      when: manual      # ← non-negotiable
    - when: never
  id_tokens:
    GITLAB_OIDC_TOKEN:
      aud: https://gitlab.com
  before_script: *aws_setup
  script:
    - |
      terragrunt run --all apply \
        --non-interactive \
        --log-level ${LOG_LEVEL} \
        --working-dir ${TG_ENV_DIR} \
        ${TG_EXCLUDE_ARGS}

The git config rewrite in init and plan is necessary for private GitLab module repos: it rewrites https://gitlab.com/ URLs to use CI_JOB_TOKEN for authentication, so terragrunt init can clone module sources from private repos.

ℹ️ run --all in GitLab CI

Same as the GitHub Actions pipeline — terragrunt run --all is the current syntax (v0.54+). Older pipelines used terragrunt run-all. If your CI image pins an older Terragrunt version, use --experiment cli-redesign as a bridge. Update the image to >= 0.54 and drop the flag.

Concrete jobs: static per environment

Each environment gets three concrete jobs:

# global environment
"init:global":
  extends: .init_template
  variables:
    TG_ENV_DIR: "${CI_PROJECT_DIR}/mgmt/eu-west-1/global"

"plan:global":
  extends: .plan_template
  needs: ["init:global"]
  variables:
    TG_ENV_DIR: "${CI_PROJECT_DIR}/mgmt/eu-west-1/global"

"apply:global":
  extends: .apply_template
  needs: ["plan:global"]
  variables:
    TG_ENV_DIR: "${CI_PROJECT_DIR}/mgmt/eu-west-1/global"

# production — with module exclusions for anything not CI-ready
"init:production":
  extends: .init_template
  variables:
    TG_ENV_DIR: "${CI_PROJECT_DIR}/prod/us-east-1/prod"
    TG_EXCLUDE_ARGS: "--queue-exclude-dir ${CI_PROJECT_DIR}/prod/us-east-1/prod/legacy-module"

The TG_EXCLUDE_ARGS variable uses --queue-exclude-dir to skip modules during run --all. This is the escape hatch for modules being migrated, requiring manual input, or managed by a different team — they stay in the directory tree but are skipped by CI.

This is a static pipeline — unlike GitHub Actions’ dynamic matrix, each environment is an explicit job definition. Adding an environment means editing .gitlab-ci.yml. The tradeoff: the pipeline is completely visible in the UI without matrix expansion, and GitLab’s DAG view gives you the dependency graph for free.

Why apply is always `when: manual` — and why the rule is absolute

The story at the top of this post is why. But there’s a systemic reason too.

GitLab CI pipelines run on every push that matches the branch rules. If apply is automatic, then every typo fix, every README update, every dependency bump that touches a .hcl file triggers an apply across however many environments are in the matrix. Most of the time, nothing bad happens. The one time something does, you don’t have a human in the loop to catch it.

when: manual costs one click per environment per deployment. In exchange, you get:

A human confirms they intend the apply before it runs
The apply doesn’t race with someone else’s manual local run
You have a clear audit trail in the GitLab pipeline UI: who clicked, when

This is different from the GitHub Actions setup in Part 4, where apply is automatic on merge (with optional environment protection rules). Both are defensible. The manual gate is the right default for teams that are still building trust in the automation.

Shared CI templates for module repos

The shared-ci/tf-versioning-semantic-release.gitlab-ci.yml template (used in Part 2 for module repos) is included via GitLab’s include: directive:

include:
  - project: 'example-group/shared-ci'
    ref: main
    file: 'tf-versioning-semantic-release.gitlab-ci.yml'

Module repos get validation and semantic-release CI with zero per-repo configuration — they just need conventional commit messages and the include. The template writes a default .releaserc.yml if none exists.

What comes next in this series

This is the last post in the Terraform + Terragrunt track. But the series isn’t done.

Declarative IaC is a broad tent. Terraform HCL is one dialect. Over the coming posts I’ll cover:

CDK for Terraform (CDKTF) — writing your infrastructure in TypeScript or Python, compiled to Terraform JSON. Same providers, same state, different mental model. Where it wins over HCL, where it makes things harder.
Pulumi — infrastructure as actual code, with real type systems, real loops, and real functions. A fundamentally different approach to the “declare what you want” problem.
OpenTofu — the open-source Terraform fork and where it stands today relative to HashiCorp’s BSL licensing change.

The 2025 series was Terragrunt and GitHub/GitLab CI because that’s what was in production at the time. The 2026 posts will add the alternatives I’ve been running in parallel. Same problem space, different tools.

Reference: tf-live demo repo · hagzag/tf-modules

Series Navigation