TL;DR

I’ve been teaching networking, sysadmin, and DevOps in Israeli colleges and bootcamps since the early 2000s. In all those years, DNS has consistently been the topic students least respected and engineers most underestimated — and the topic that quietly underpins every load balancer, service mesh, canary release, and high-availability story we now take for granted. This is the introduction to a four-part series I’ve been wanting to write for a long time, and the post I plan to send students who ask “why do we even learn this?”

Introduction

I started teaching in Israeli colleges and high-tech bootcamps somewhere around 2003. Different rooms over the years — Linux fundamentals, networking, sysadmin tracks, SRE programs, DevOps cohorts — but the same pattern every time. We’d get to the DNS chapter. The energy in the room would drop. Phones would come out. Someone would ask, politely, whether we could “just skip ahead to the Kubernetes part.”

I get it. DNS doesn’t demo well in a classroom. You can show a dig command, you can draw a hierarchical tree on the whiteboard, you can talk about caching and TTLs — and none of it lights up the way a live kubectl rollout does. Compared to containers, service meshes, GitOps pipelines, IaC modules, observability stacks, DNS feels like infrastructure plumbing from a previous era. Like teaching someone TCP windowing when they wanted to learn React.

For years I let it slide. I’d cover the basics, hand out the cheat sheet, move on. The students who’d later end up running production systems would — without exception — discover the hard way that DNS was where their incidents lived. They’d email me a year or two later from their first SRE job: “Hey, remember that DNS chapter we kind of breezed through? Yeah.”

Why DNS keeps getting underestimated

There are a few reasons DNS gets disrespected, and they’re worth naming because they’re the same reasons it keeps biting people in production.

It looks deceptively simple. Name in, IP out. What’s there to learn? The dig man page is dense, sure, but the conceptual model fits on an index card. So curricula treat it as a one-week topic between “ports and protocols” and “HTTP fundamentals,” and move on.

The interesting stuff is invisible. When DNS works, you don’t notice it. When it fails, the symptoms look like everything else — timeouts, certificate errors, slow page loads, mystery 500s. Most engineers spend their careers benefiting from DNS without ever having to debug it from first principles. Until they do.

It hides under newer abstractions. Modern engineers learn “Kubernetes Services” without realizing that’s a CoreDNS-backed service catalog underneath. They learn “Route 53 weighted routing” without realizing that’s just A records with TTL games. They learn “Consul service discovery” without realizing the magic is a DNS interface on port 8600. The abstractions are real, but they’re abstractions over DNS — not replacements for it.

It’s old. RFC 1034 was published in 1987. Engineers raised on cloud-native treat anything older than Kubernetes (2014) as “legacy” by reflex. DNS feels like a historical artifact. It is not. It is, with TLS, the most actively load-bearing 30-year-old protocol on the internet.

Why now

The topic finally caught up with me. Over the last few years of consulting through Tikal — across healthcare AI, fintech, SaaS platforms, federal compliance work — every meaningful infrastructure story I’ve been pulled into has had DNS sitting at its center. Not as a footnote. As the substrate.

Multi-region high availability? That’s DNS-layer routing decisions plus TTL trade-offs.

Canary releases at the edge? That’s weighted DNS records modulating traffic to v2 before the L7 LB ever sees it.

Service mesh? It’s mTLS and traffic policy on top of, fundamentally, a DNS-resolved endpoint discovery layer.

Zero-trust networking? Half the policy enforcement points key off DNS-resolved identities, and the failure modes are DNS failure modes.

Email security in 2026? It’s three TXT records (SPF, DKIM, DMARC) plus MTA-STS, TLSRPT, and DANE — all DNS.

Cert issuance? CAA records gate which CAs can mint certs for your domain. ACME validation is DNS challenges.

Subdomain takeover, the highest-prevalence cloud security issue I see in audits? Pure DNS hygiene. Forgotten CNAME records pointing at deprovisioned resources.

The thing my students didn’t want to learn turns out to be the thing the modern stack is built on. So I’m writing it down.

What this series is

Four posts, each with a hands-on k3d lab in a companion repo at github.com/hagzag/dns-evolution-in-practice. The series covers, in order:

Part 1 — From /etc/hosts to BIND-9: The Origin Story Every SRE Should Know. Where DNS came from, the BIND lineage at Berkeley, the cloud-native turn (SkyDNS → CoreDNS), how resolution actually flows, and the record types you should be able to read fluently — A, AAAA, CNAME (and the apex problem), SOA, NS, MX, TXT.

Part 2 — DNS at Scale: Service Discovery with Consul and CoreDNS. Why ephemeral workloads broke traditional DNS. Consul’s DNS interface. CoreDNS’s plugin architecture. SRV records and Kubernetes headless services. The ndots:5 performance trap costing your cluster milliseconds on every external lookup.

Part 3 — DNS as a Load Balancer: AWS, GCP, Azure and the L3-to-L7 Stack. The load balancer taxonomy across three clouds. Why the apex CNAME problem matters here more than anywhere. Weighted, latency, geolocation, and failover routing. The brutal truth about TTLs in production (spoiler: don’t rely on DNS for sub-minute failover).

Part 4 — When DNS Lies: Cache Poisoning, Spoofing, and How to Defend Yourself. The Kaminsky attack, DigiNotar, Sea Turtle, MyEtherWallet — the real public incidents and what each taught us. The layered defenses that hold up: DNSSEC, DoT/DoH, CAA records, registrar lock, subdomain takeover scanning. Includes a contained k3d lab where you’ll watch a DNS spoof succeed against an unprotected resolver, then watch DNSSEC validation defeat the same attack. (Educational use only — full disclaimer in the post.)

Who this is for

  • DevOps engineers and SREs who’ve inherited DNS configurations they didn’t write and don’t fully understand
  • Platform engineers building service catalogs, mesh layers, or multi-region architectures
  • Security-curious developers who keep reading “it’s always DNS” memes without quite knowing why
  • Students currently sitting in someone’s networking class wondering if this DNS chapter is actually going to matter

If you can read a dig output and you’ve used kubectl once or twice, you’re ready. The labs assume Docker and k3d on your laptop. No cloud account required.

How to read it

You can read straight through (Part 1 → Part 4), or jump to whichever post matches what you’re currently fighting:

  • “My Kubernetes pods can’t reach an external API and DNS is slow.” → Part 2.
  • “My multi-region failover takes longer than my SLA allows.” → Part 3.
  • “My security team flagged our domain in a recent audit.” → Part 4.
  • “I have no idea what an SOA record is and I’m too embarrassed to ask.” → Part 1.

Each post has its own lab. Run them. Break them. The fastest way to internalize DNS is to watch it return wrong answers in a sandbox you control.

What I tell students now

When students ask why we have to learn this, I have a different answer than I had ten years ago. It goes roughly like this:

Every system you’ll ever build assumes DNS works correctly, returns answers fast enough, and tells the truth. None of those three things are guaranteed. Your job, eventually, will be to know exactly when each of those assumptions breaks and what to do about it. Everything else in your stack — Kubernetes, service mesh, multi-cloud, zero-trust — is built on this 1980s protocol that we somehow never replaced. You’d better learn it.

This series is the long form of that answer. I’ll be sending it to every student who asks from now on.

Onwards to Part 1 — From /etc/hosts to BIND-9.


The Companion Repo

github.com/hagzag/dns-evolution-in-practice

Four k3d-based labs, one per post. Run them locally:

git clone https://github.com/hagzag/dns-evolution-in-practice
cd dns-evolution-in-practice
task part1:run   # or part2, part3, part4