DNS, Part 3 — DNS as a Load Balancer: AWS, GCP, Azure and the L3-to-L7 Stack

TL;DR

DNS is a load balancer. It’s the first load balancer your traffic hits, and it’s the one that decides which region, which datacenter, and which IP your user reaches before any L4 or L7 proxy ever gets involved. This post walks the load balancer taxonomy across AWS, GCP, and Azure, explains why CNAME-vs-alias matters here more than anywhere else, and tells you the truth about TTLs that nobody on a sales call will say out loud.

Introduction

Pager: “the site is down.” LB dashboard: all green. CloudWatch: traffic flat at zero. Synthetic checks from Tokyo: failing. From Frankfurt: passing. From São Paulo: failing.

It’s always DNS, and specifically it’s almost always DNS-based traffic management — health checks, latency routing, geolocation policies, TTLs that lied to you about how fast failover would be.

Part 1 gave us the mechanics. Part 2 gave us the cluster-internal view. This post zooms back out to the public internet, where DNS does the heaviest lifting in your high-availability story.

The load balancer taxonomy in one paragraph

A load balancer is fundamentally a thing that picks a backend. The interesting question is: what information does it have to pick with?

Layer 3/4 load balancers see IPs and ports. They route TCP or UDP flows. AWS Network Load Balancer (NLB), GCP TCP/UDP Load Balancing, Azure Standard Load Balancer. They are fast, they are cheap, they preserve client IPs, and they can’t tell a /login request apart from a /static/logo.png request because they don’t read HTTP. They route by 5-tuple.

Layer 7 load balancers terminate the connection and read the protocol. AWS Application Load Balancer (ALB), GCP HTTPS Load Balancing, Azure Application Gateway. They can route by host header, path, cookies, headers, query strings. They do TLS termination, WAF integration, request rewrites. They cost more and they’re slower per-request, but they let you do the routing your application actually needs.

You usually want both: an L4 layer for raw throughput and IP preservation, an L7 layer for HTTP semantics. And you want DNS in front of all of it.

Where DNS sits — and the apex problem returns

When you provision an ALB, AWS gives you back a hostname:

my-app-1234567890.us-east-1.elb.amazonaws.com

You point your application at that hostname. But your customer-facing domain is app.example.com and example.com (the apex, no www). Two problems:

The ALB hostname’s IPs change. AWS reserves the right to swap them whenever it scales the LB. So you must use a hostname, not an IP.
You can’t CNAME the apex of example.com (we covered why in Part 1 — CNAME forbids siblings, and the apex needs SOA and NS).

Route 53’s alias record solves both. At the protocol level it’s an A or AAAA record. But Route 53 resolves the LB’s hostname server-side at query time and returns the current IPs to the client directly. No client-side CNAME chase, no apex violation. GCP’s Cloud DNS calls this an “alias” via specific record types, and Azure DNS has “alias records” with similar semantics.

This is one of the genuinely useful cloud-vendor inventions. If you’re doing real production work, you should know the difference between a CNAME and an alias and pick the right one every time.

# CNAME — chain of lookups
$ dig app.example.com
app.example.com.  60 IN CNAME my-app.us-east-1.elb.amazonaws.com.
my-app.us-east-1.elb.amazonaws.com. 60 IN A 52.x.x.x

# Alias — single answer, no chase
$ dig example.com
example.com.       60 IN A 52.x.x.x

DNS-based traffic management

Once DNS is the entry point, you can make it smart. Route 53, Cloud DNS, and Azure DNS all offer some version of these routing policies:

Weighted routing. Send 90% of traffic to v1 and 10% to v2. The DNS server picks responses according to assigned weights. This is your canary deployment lever at the DNS layer — clumsy but effective for blue-green at scale.

Latency-based routing. Return the IP of the region closest to the resolver, not the user. This is an important distinction. The resolver is usually nearby, but for users on 8.8.8.8 from anywhere, the resolver geography is Google’s, not the user’s. ECS (EDNS Client Subnet) helps but isn’t universally honored.

Geolocation routing. Hard rules: traffic from EU goes to EU servers, traffic from APAC goes to APAC servers. Useful for compliance (GDPR data residency) more than performance.

Geoproximity routing. Like geolocation but with bias adjustments. Route 53 calls this “geoproximity”; the others have similar features under different names.

Failover routing. Primary IP / secondary IP. Health checks decide which one DNS returns. The failover is “automatic” in the sense that DNS returns a different answer — but how fast that propagates depends entirely on TTL.

The brutal truth about TTLs

Every DNS-based failover story ends at this paragraph, so let’s just go there directly.

When a record has TTL 60, that means “any resolver may cache this for up to 60 seconds.” If your authoritative server changes the answer mid-cache-window, the resolver doesn’t know. It serves the old answer until expiry. Worse, some resolvers ignore TTL and cache longer. Some ISP resolvers used to cache for hours regardless of what you set. The behavior has improved, but it has not become deterministic.

So when AWS tells you Route 53 health checks fail over in “tens of seconds,” that’s the time for the authoritative server to change its answer. The time for users to actually see the new answer is that + TTL + downstream resolver caching + browser DNS cache + OS DNS cache. In practice, planning for 2-3 minutes of partial failover is a reasonable safety margin for a 60-second TTL. With longer TTLs, you simply cannot do fast failover. You need to plan for long-TTL records (5 minutes or more) for stability and short-TTL records (60 seconds or less) for things you actually expect to fail over.

The answer most production-grade architectures land on: don’t rely on DNS for sub-minute failover. Rely on the load balancer’s health checks for fast failover within a region. Rely on DNS for region-level failover, which is rare enough that tens of seconds of impact is acceptable.

# Watch the TTL countdown in real time
$ while true; do dig +short app.example.com TTL; sleep 5; done

Multi-region patterns that actually work

A few patterns that hold up under real load:

Active-active with latency routing. Two or more regions, each with its own L7 LB, each healthy. DNS hands out the closest region per query. Each region serves its share. If one region fails, health checks remove it from DNS rotation. Pros: uses your capacity, simple to reason about. Cons: data layer must be replicated, which is the actually-hard part.

Active-passive with failover routing. One region is primary, one is warm standby. DNS only returns the standby’s IP if the primary fails its health check. Pros: cheaper (you only run the primary at full capacity). Cons: failover is slow (TTL + warmup), and DR drills are critical because you discover everything broken about your standby on the day you need it.

Per-customer routing. Geolocation rules for compliance, latency rules for everyone else. Common in SaaS with EU customer data residency requirements.

Anycast at the DNS layer itself. Route 53, Cloud DNS, Azure DNS are all anycast — the same IPs are advertised from many locations. Your authoritative DNS responds from whichever PoP is closest to the resolver. This is invisible to you as a customer, but it’s why public DNS feels fast.

The five-minute production checklist

Things I find missing on most audits:

Use alias records at the apex, not CNAMEs (you can’t anyway, but I see attempts).
Set TTLs intentionally. 300s is a reasonable default. 60s on records you expect to flip. Anything longer than 3600s is a failover landmine.
Health-check at the right layer. DNS health checks for region selection. LB health checks for instance selection. Don’t try to make DNS do what an LB does well.
Monitor your TTL drift. If someone bumps a TTL from 60 to 86400 “to reduce query costs,” your next failover takes a day. Alert on TTL increases.
Test failover in non-prod. Annually. Quarterly is better. Real chaos engineering at the DNS layer is rare and almost always worth it.

What’s next

We’ve spent three posts assuming the answers DNS gives us back are true. About that.

Part 4 is the security payoff: cache poisoning, the Kaminsky attack, on-path MITM, registrar hijacks, subdomain takeover — and the layered defenses that actually work (DNSSEC, DoT/DoH, CAA, registrar lock). With a hands-on lab where you’ll see a DNS spoof succeed against an unprotected resolver, then watch DNSSEC validation defeat the same attack.

Lab — Try It Yourself

Repo: github.com/hagzag/dns-evolution-in-practice Lab: practice/part3/

The Part 3 lab simulates weighted DNS routing and health-check failover locally on k3d — no cloud account required. You’ll watch traffic shift between two regional backends, then kill one and observe how TTL controls the failover window.

task part3:run