Beyond the Prompt: The Architecture of Loop Engineering

TL;DR

Prompt engineering didn’t get better — it got demoted. It’s now a subroutine inside something bigger: a loop that observes, acts, evaluates, and corrects on its own, with a compiler or a test suite standing in for your “please double-check your work” follow-up message. The skill that matters in 2026 isn’t writing the perfect instruction — it’s designing the harness the model runs inside, and pricing that harness so it doesn’t quietly burn your budget. A loop that isn’t cost-optimized isn’t a feature. It’s a bug.

Originally published at portfolio.hagzag.com.

The Prompt That Worked on Tuesday

Somewhere on your team there’s a prompt template with a comment like # DO NOT TOUCH — took 3 hours to get this right. It had a persona (“You are a senior Terraform engineer with 10 years of experience…”), a few-shot example lifted from the one time it worked, and a magic phrase someone swore made the model stop hallucinating resource names. It shipped. It worked. Then a model version bumped, or the input got 200 tokens longer, and it quietly started failing in ways nobody noticed until a PR review.

That’s not a bug in your prompt. That’s the whole discipline hitting its ceiling.

Prompt engineering treated a non-deterministic system like a deterministic API — as if the right string of words was a function signature you could nail down once and call forever. It couldn’t scale, because the contract it was implicitly promising (“this exact phrasing produces this exact behavior”) was never one the underlying model actually offered. We’ve moved through two paradigm shifts since, and the destination is a discipline most teams haven’t named yet: Loop Engineering.

Wave One: Prompt Engineering, or Micro-Managing a Black Box

Prompt engineering was real work, and dismissing it as a fad undersells what it taught the industry: framing changes output, examples anchor format, and specificity beats vagueness. Few-shot prompting, persona-setting, chain-of-thought nudges — all legitimate techniques, all still in use today, just no longer the unit of engineering.

The failure mode was structural, not a skill issue. Every prompt was a hand-tuned artifact, coupled tightly to a specific model version and context length. Teams built regression suites for prompts the same way you’d unit-test a function — and watched them go stale every time a provider shipped a quiet update. “Prompt drift” wasn’t an edge case; it was the natural entropy of pinning deterministic behavior onto a probabilistic system using nothing but English.

Wave Two: Context Engineering, or the Hydration Problem

The industry’s answer was reasonable: if the model’s behavior depends on what it knows at inference time, stop fighting the prompt and start engineering the data going into it. RAG pipelines, vector databases, embedding strategies, and — as context windows exploded past a million tokens — the brute-force option of just stuffing in the whole codebase.

This is where platform engineers started earning their keep on the AI side of the SDLC. The work stopped being “what do I say to the model” and became “what does the model need to see to answer correctly” — chunking strategy, retrieval relevance, index freshness, and the uncomfortable discovery that a bigger context window doesn’t mean a better answer. Needle-in-a-haystack degradation is real; a model with a million tokens of mostly-irrelevant code can perform worse than one with 20K tokens of exactly the right code. Context engineering’s real contribution was a data infrastructure problem engineers already knew how to solve.

It still had a ceiling, though: a perfectly hydrated model that gets something wrong on the first pass still just… gets it wrong. Nothing in the pipeline told it to check.

Wave Three: Loop Engineering, Where the Model Grades Its Own Homework

Loop engineering is the shift from informing the model to instrumenting it. You’re no longer crafting the one perfect input — you’re designing a closed system: the model acts, something deterministic evaluates the action, the result feeds back in, and the model corrects. Observe, act, evaluate, correct, repeat until a real gate — not a vibe — says stop.

The gate is the entire point. Give the loop a compiler, a linter, and a test runner, and you’ve replaced “trust me, this looks right” with “it doesn’t merge until it’s green.” This is what tool use and function calling actually unlocked: not a chatbot that can describe running pytest, but an agent that runs it, reads the failure, and tries again. Protocols like the Model Context Protocol standardized how a model discovers and calls those tools instead of every team hand-rolling brittle glue code. Spec-driven development pushes the gate earlier still — write the spec and acceptance tests first, and the loop has an unambiguous target instead of an inferred one.

The engineer’s job title didn’t change, but the job did. You’re not writing the function anymore. You’re deciding what “done” means well enough that a machine can verify it without you in the loop for every iteration — and you’re deciding when a human has to step back in.

The Token Economy: Where Architecture Meets the Invoice

Here’s the part most “agentic AI” content skips, and it’s the part that turns a demo into a production system: every iteration of that loop costs real money, and an unbounded loop is a blank check.

The cost curve mirrors the three waves exactly. Prompt engineering’s financial discipline was trivial — shave a few hundred tokens off a template. Context engineering introduced what I’ve started calling the hydration tax: stuffing a full codebase into every request multiplies your input-token bill by however many times you re-send it, and it compounds fast when nothing is cached. Loop engineering introduces a sharper risk — an autonomous agent that hits an edge case, misreads a test failure, and loops nine more times on the same fix, each pass re-reading the whole conversation history. Nobody signed off on that ninth iteration. Nobody designed the loop to stop.

Artificial Analysis’s pricing data makes the stakes concrete: blended token prices across frontier and open models span roughly $0.01 per million tokens (Qwen3.5 0.8B) to well over $20 for top-tier reasoning models — a thousand-fold spread for the same request. I made this argument with real numbers in The $1,892 Agent and the follow-up Agent Cost Wars: MiniMax M2.5 landed within striking distance of Opus-tier performance at roughly 3% of the cost, and M2.7 later matched a large share of GLM-5’s coding benchmark score at a fraction of the price. Route every tool-execution check through a frontier reasoning model, and you’re paying reasoning-model prices for work a $0.25-per-million-token model could gate just as reliably.

Three patterns separate a production-grade loop from a demo that racked up a surprise bill:

Prompt caching — architect the loop so the system prompt, tool schemas, and unchanging codebase context are cache hits, not fresh input tokens, on every single iteration.
Dynamic model routing — route deterministic, low-ambiguity work (linting, formatting checks, simple tool calls) to small, cheap models, and reserve frontier reasoning models for the genuinely hard decision points.
State compaction — summarize and condense the loop’s running history instead of letting the model re-digest raw logs and full diffs on every pass; token cost per iteration should shrink or stay flat, not grow with iteration count.

Financial predictability and architectural efficiency aren’t two separate concerns you trade off against each other — in a well-designed loop, they’re the same design decision viewed from two angles.

What Actually Went Wrong

I’ve seen the shape of this failure often enough across the industry’s current wave of agent rollouts to know it isn’t hypothetical, even where I can’t cite a specific engagement: a team wires an autonomous loop to fix failing tests overnight, unattended, with no iteration cap and no cost circuit breaker. The agent hits a genuinely flaky test, not a broken one, and spends the night trying dozens of “fixes,” each pass re-sending the full test output and diff history because nobody built state compaction into the loop. Morning arrives with a green build, a plausible commit, and a token bill that dwarfs what a human would have cost to glance at the flaky test for five minutes.

The lesson isn’t “don’t trust the loop.” It’s that the loop needed a budget, a max-iteration count, and a human-escalation path from day one — the same way you wouldn’t ship a CI pipeline without a timeout. Autonomy without a spending gate isn’t agentic engineering. It’s an uncapped retry loop with a good PR description.

Better Loops, Not Better Prompts

None of this makes the earlier waves obsolete — you still need decent framing, and you still need good context. But neither is where the leverage is anymore. The engineers pulling ahead right now aren’t the ones with the best prompt library; they’re the ones who can design a feedback loop with the right gate, the right cost profile, and the right off-ramp to a human when the loop can’t converge on its own.

That’s also, not coincidentally, an architecture problem — which means it’s one senior engineers are already equipped to solve, even if the tooling is new. Next up on this thread: what one of these loops actually looks like wired into a real pipeline, gate by gate, cost meter running the whole time.