DNS, Part 2 — DNS at Scale: Service Discovery with Consul and CoreDNS

TL;DR

Static zone files were never going to survive a 500-pod autoscaling event. Cloud-native rebuilt DNS as a real-time service catalog: Consul gave us the registry pattern, CoreDNS gave us the plugin pattern, and Kubernetes wired them into every cluster you operate. This post shows how the pieces fit, what SRV records actually do for you, and why your ndots:5 setting is probably costing you milliseconds on every external lookup.

Introduction

In Part 1 we walked from HOSTS.TXT at SRI-NIC to BIND-9 to the cloud-native turn. We left off at a teaser: what happens when “the IP behind a name” changes every 90 seconds?

The answer is “BIND breaks, and you go on a journey.” That journey is what this post is about.

Why static DNS broke

Traditional DNS assumes records change rarely. You buy a server. You give it a name. The name and the IP stay together for years. You edit a zone file maybe once a quarter. Refresh a slave. Done.

Now imagine your “server” is a pod. Its IP is assigned by the CNI on creation, valid for the lifetime of the pod, and reclaimed when it dies. The pod might live 90 seconds during a deploy, or be evicted by Karpenter during scale-down, or be replaced by a HorizontalPodAutoscaler reacting to traffic. There are 50 of them behind the same logical service. They come and go faster than any human can edit a zone file.

This is not a corner case. This is the default mode of every Kubernetes cluster. The problem is structural: the registry needs to be the source of truth, not the zone file.

The service discovery patterns

Three patterns are worth distinguishing before we get into specific tools.

Client-side discovery. The client asks the registry directly, picks an instance, and connects. Netflix’s Eureka popularized this. The client is smart, the registry is dumb, and load balancing happens on the caller. Pros: minimal infrastructure. Cons: every language needs a client library, and you couple your apps to the registry’s API.

Server-side discovery. The client connects to a stable endpoint (a load balancer, an Envoy sidecar, a service mesh proxy). The endpoint asks the registry. The client knows nothing. Pros: language-agnostic. Cons: you now operate the proxy layer.

DNS-based discovery. The registry exposes itself as a DNS server. Clients use DNS — which every operating system on earth already supports natively. The TTL becomes your refresh interval. Pros: zero client changes. Cons: DNS caching is real, and DNS lacks rich metadata (you only get IPs and ports, no health state on the wire).

Cloud-native, in practice, uses all three at once, but DNS-based discovery is the substrate they all build on. Which is why Consul and CoreDNS matter so much.

Consul: the registry that speaks DNS

HashiCorp shipped Consul in 2014 with four things bundled into one binary:

A service catalog that records every service and its instances
Health checks that mark instances healthy or failing
A KV store for configuration
A DNS interface at port 8600

That last one was the trojan horse. Consul speaks BIND-style DNS to anything that asks. You register a service called web running on three nodes, and every node running a Consul agent will answer web.service.consul with the healthy instances:

$ dig @127.0.0.1 -p 8600 web.service.consul +short
10.0.1.12
10.0.1.45
10.0.2.31

Failing instances are automatically excluded from responses. The DNS layer becomes load-aware without the client knowing anything about Consul. You can also query SRV records to get ports back:

$ dig @127.0.0.1 -p 8600 web.service.consul SRV +short
1 1 8080 web-1.node.dc1.consul.
1 1 8080 web-2.node.dc1.consul.

For datacenter operations on bare metal or VMs, this is still my default answer. Consul is the unsung hero of pre-Kubernetes service discovery and it remains genuinely excellent.

CoreDNS: the plugin pattern

CoreDNS is the cluster DNS for Kubernetes, but its design is more interesting than that one fact suggests.

When Miek Gieben built CoreDNS in 2016, he forked the Caddy web server because he liked Caddy’s plugin architecture. Each plugin handles one concern (caching, forwarding, Kubernetes integration, Prometheus metrics, etc.) and they’re chained together via a Corefile:

.:53 {
    errors
    health
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

That’s the entire configuration of the DNS server running inside every Kubernetes cluster. Compare that to a named.conf from BIND-9 — same job, an order of magnitude less ceremony.

The kubernetes plugin is where the magic happens. It watches the API server for Service and Endpoint objects and answers DNS queries for *.cluster.local directly from that data. No zone files, no reloads, no replication protocol. The Kubernetes API is the zone file.

SRV records, and why they finally matter

Most engineers know A and CNAME. Few use SRV in anger. Cloud-native changed that.

An SRV record returns four fields: priority, weight, port, target. The use case it was invented for (RFC 2782, 2000) was protocol-aware service location — find me the LDAP server for this domain, find me the SIP server, find me the Kerberos KDC. It never went mainstream on the public internet, but it absolutely went mainstream inside clusters.

Kubernetes headless services (services with clusterIP: None) return SRV records that name every pod individually:

$ kubectl exec -it debug -- dig +short SRV _http._tcp.echo.default.svc.cluster.local
0 33 8080 echo-0.echo.default.svc.cluster.local.
0 33 8080 echo-1.echo.default.svc.cluster.local.
0 33 8080 echo-2.echo.default.svc.cluster.local.

This is how StatefulSets give every pod a stable, queryable identity. It’s how Kafka clients discover brokers, how Cassandra discovers peers, how etcd quorums find each other. SRV is the protocol substrate for stateful workloads on Kubernetes.

The `ndots:5` trap

Here’s a thing that costs every Kubernetes cluster more than it should: the default ndots setting.

ndots is a /etc/resolv.conf option. It says “if a hostname has fewer than N dots, append the search domain list before trying the absolute name.” Kubernetes sets ndots:5, which means any short hostname goes through the search list first.

$ kubectl exec -it debug -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

So when your pod resolves api.github.com (only 2 dots), the resolver tries:

api.github.com.default.svc.cluster.local → NXDOMAIN
api.github.com.svc.cluster.local → NXDOMAIN
api.github.com.cluster.local → NXDOMAIN
api.github.com → ✅

That’s three wasted round-trips before every external lookup. At scale, this lights up CoreDNS metrics with NXDOMAIN noise and adds latency to every API call your pods make.

The fix is per-pod dnsConfig:

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "1"

Or use FQDNs in your code (api.github.com. with the trailing dot). Either way, please measure your DNS latency. NodeLocal DNSCache helps too, but ndots:5 is the gift that keeps on giving.

Failure modes worth knowing

Three patterns that have burned every team I’ve worked with:

Stale CoreDNS caches. Default cache is 30 seconds. After a service deletion, a client can keep resolving the old IPs for that long. If you’re rolling out, you’ve already moved on. If you’re failing over, that 30 seconds matters.

Negative caching of NXDOMAIN. A typo’d hostname can poison caches for the duration of the SOA’s minimum field. Lookups that “should work after I fix it” often don’t, until the cache expires.

The 2017-2019 era of CoreDNS outages. Multiple high-profile clusters lost DNS resolution during node pressure events because CoreDNS pods got evicted. The remediation was NodeLocal DNSCache — a per-node DNS cache that runs as a DaemonSet and absorbs the per-pod query load before it ever reaches the cluster CoreDNS. If you run any production Kubernetes, you should run NodeLocal DNSCache. Full stop.

What’s next

We’ve turned DNS from a static zone into a real-time service catalog. Names now move with workloads. But we haven’t yet asked the harder question: when traffic crosses regions and clouds, which IP does DNS return, and how fast does it change when something fails?

That’s Part 3: DNS as a load balancer. We’ll walk through the L3-to-L7 LB taxonomy across AWS, GCP, and Azure, and look at weighted, latency, geolocation, and failover routing — and the brutal truth about TTLs in production.

Lab — Try It Yourself

Repo: github.com/hagzag/dns-evolution-in-practice Lab: practice/part2/

The Part 2 lab spins up Consul and CoreDNS in a k3d cluster, registers demo services, walks through SRV record queries, and reproduces the ndots:5 performance problem so you can measure it for yourself before applying the fix.

task part2:run

DNS, Part 2 — DNS at Scale: Service Discovery with Consul and CoreDNS

TL;DR

Introduction

Why static DNS broke

The service discovery patterns

Consul: the registry that speaks DNS

CoreDNS: the plugin pattern

SRV records, and why they finally matter

The `ndots:5` trap

Failure modes worth knowing

What’s next

Lab — Try It Yourself

Further Reading

Discussion

TL;DR

Introduction

Why static DNS broke

The service discovery patterns

Consul: the registry that speaks DNS

CoreDNS: the plugin pattern

SRV records, and why they finally matter

The ndots:5 trap

Failure modes worth knowing

What’s next

Lab — Try It Yourself

Further Reading

Discussion

The `ndots:5` trap