From Test-Driven to Loop-Driven Development

The evolution of AI coding from tests and prompts to agents, harnesses, and supervised loops.

Jun 11, 2026

Software engineering was always a loop. You change something, run it, check the result, and repeat. Test-driven development made this loop explicit: write a failing test, make it pass, then refactor. The loop is small, fast, and grounded in feedback.

Other development practices widened the same idea. Behaviour-Driven Development (BDD) moved the loop toward shared behavior and examples. Acceptance testing moved it toward user-visible definitions of done. CI moved it into the delivery pipeline. The shape stayed the same: define the next bit of intent, run the system, check the result, and tighten the design.

AI is widening the loop again. The thing inside one iteration used to be a line, a function, or a failing test. Now it can be a task, a pull request, a migration, or a recurring workflow. That is what I mean by loop-driven development: the engineer designs the trigger, goal, context, harness, verifier, and state around an agent loop.

*The Evolution of AI Coding. Autocomplete, Prompt Engineering, Context Engineering, Harness Engineering, and Loop Engineering.*

This is not a claim that agents can safely own arbitrary software delivery. It is the opposite. The more autonomy you give the loop, the stronger the checks have to become. TDD did not remove engineering judgment. It pushed judgment into tests and refactoring. Loop-driven development does the same at a larger scale.

The progression is additive

The useful parts of each era do not disappear. Each era keeps the previous layer and adds a new control surface.

The progression looks like this:

code completion -> prompt loop -> repo context -> harness -> supervised loop

The unit of work keeps getting wider, and the engineer’s leverage point keeps moving up.

1. Autocomplete

2021-2022

Autocomplete put the model inside the editor. GitHub Copilot made this mainstream by drawing context from the code being edited and suggesting whole lines or functions. Cursor Tab belongs in the same era. It is the completion side of Cursor, where the model predicts the next edit and the developer accepts or rejects it while writing code.

The loop still lives mostly in the developer’s hands. You type, inspect the suggestion, accept or reject it, and continue. The benefit is speed: less boilerplate, fewer mechanical edits, and faster movement through familiar code. The limit is scope. The model helps with the next edit, but it does not own the task.

What got added:

Model
Local file context
Inline completion

This was the Autocomplete era.

2. Prompt Engineering

2022-2023

The next step was to move from completion to task steering. ReAct was not a coding assistant, but it gave agents an important primitive: reason, act, observe, repeat. The model could think about a step, call a tool, read the result, and continue.

AutoGPT made the idea feel autonomous. Instead of asking for one answer, you gave the system a goal and let it prompt itself. That shift created the first native discipline of this era: prompt engineering. The developer was no longer only writing code. The developer was writing instructions that caused code to be written.

I still believe this skill never fully goes away. You have to know how to talk to these models well, in one form or another, which is what I have been collecting in my Prompt Patterns book.

The benefit was delegation at the task level. You could ask for a script, a test suite, an investigation, or a migration plan. The limit was convergence. A prompt loop without disciplined context and a reliable stop condition can drift, repeat itself, or optimize for the wrong thing.

What got added:

✓ Model
Tools
Goal
Prompt loop

This was the Prompt Engineering era.

3. Context Engineering

2024-2025

Once agents could act, the bottleneck became what they could see. A coding agent needs repo context, not only a prompt. It needs files, tests, logs, conventions, architecture notes, issue history, and the current state of the work.

This is where Cursor Agent, Devin, and Ralph-style loops fit. Cursor Agent moves beyond tab completion into autonomous coding tasks, terminal commands, and file edits. Devin is positioned as an autonomous software engineer that can write, run, and test code. Ralph made a narrower but important point: durable state should live in files and git, not only in the chat transcript.

The benefit is scope. Agents can work across files, run commands, inspect failures, and make repo-aware changes. The limit is that context is not correctness. A well-contextualized agent can still finish the wrong task unless the environment can tell it what done means.

I wrote about the broader tool landscape in AI Coding Assistants Landscape. The pattern that matters here is the move from assistant-in-editor to agent-in-codebase.

What got added:

✓ Model
✓ Tools
✓ Goal
Repo context
Terminal / files
Tests

This was the Context Engineering era.

4. Harness Engineering

2025-2026

A harness is the environment a single agent runs inside. It includes the prompt, repo context, tools, sandbox, permissions, tests, linters, type checks, CI, evals, and review gates. The point of a harness is not to make the model magically correct. The point is to make the work observable, constrained, and checkable.

OpenAI Codex is a clear example. Codex runs in isolated cloud containers, works against the provided repository, edits files, runs commands, and can propose changes for review. OpenAI’s own harness engineering write-up describes the role shift directly: engineers design environments, specify intent, and build feedback loops that let agents do reliable work. Claude Code fits the same era from the terminal side: it understands a codebase, edits files, runs commands, and handles git workflows.

This is why I think the harness framing is more useful than another prompt taxonomy. In 12 Agentic Harness Patterns from Claude Code, I broke the harness into reusable patterns: persistent instructions, scoped context, memory tiers, tool permissions, lifecycle hooks, and workflow separation. In The Missing Quality Layer for AI Coding Agents, I argued that the next bottleneck is proving the diff is safe to review, not generating the diff.

The benefit is repeatability. The agent no longer just generates code. It runs inside a system that can reject bad work. Deterministic checks should come first: tests, builds, type checks, lint, contract tests, benchmarks, screenshots, traces, and CI. Model-based judging can help with subjective checks, but maker and checker should be separated. These checks matter most when they push back on the agent in the moment rather than after the fact, giving the loop backpressure so it self-corrects before a human has to step in.

What got added:

✓ Model
✓ Tools
✓ Goal
✓ Repo context
Sandbox
Verifier
CI / eval harness

This was the Harness Engineering era.

5. Loop Engineering

Now

Once the harness is reliable, the next layer is the loop that runs it. A loop is not just an automation. An automation executes fixed steps. A loop has a decision inside it. It checks whether the goal is met and decides whether to continue.

A practical agent loop has five parts:

Trigger: human kickoff, schedule, or event
Goal: the desired end state
Harness: the environment the agent runs in
Verifier: the check that decides whether to continue
State: memory outside the current model call

This is where current tools are converging. Codex has /goal for long-running work with a verifiable stopping condition and Automations for recurring tasks. Claude Code has /goal, /loop, and scheduled tasks for recurring work. MCP gives agents a standard way to connect to external tools and data sources. Addy Osmani’s Loop Engineering framing captures the same ingredient set: automations, worktrees, skills, connectors, sub-agents, and memory.

The benefit is leverage. A loop can watch CI, triage issues, update dependencies, fix flaky tests, chase review feedback, prepare PRs, and keep working until a condition holds. The risk grows with the leverage. A bad prompt wastes a turn. A bad loop can waste hours, mutate the repo, and generate a pile of plausible work that still needs human judgment.

Skills and playbooks become especially important here. A loop that has to rediscover project conventions every run is fragile. A loop that can call well-scoped skills has a better chance of doing the same work consistently. That is why I see skills as part of the loop substrate, not only better prompt files. I wrote more about that in 9 Principles That Separate Useful Skills from Markdown Essays.

What got added:

✓ Model
✓ Tools
✓ Goal
✓ Repo context
✓ Verifier
Automations

Worktrees

Skills / playbooks

Connectors (MCP)

Durable memory

Orchestration

This is the Loop Engineering era.

The Engineering Leverage Stack

What the stack below shows is not five tools but a single control point moving up. Each era lets you author less of the code directly and more of the system that produces it, trading fine-grained control for reach. The higher you stand, the more a single decision is worth, and the more it leans on the checks in the layers beneath it.

A harness is the environment for one agent run. A loop is the control system around that harness. A factory is a system of loops: one loop finds work, another implements it, another verifies it, another opens or updates the PR, and another escalates what needs human judgment.

That is the leverage shift:

The important word is wrap. Prompting did not replace coding. Context did not replace prompting. Harnesses did not replace context. Loops do not replace harnesses. Each layer wraps the one below and changes where engineering judgment is applied.

What loop-driven development means

Loop-driven development is TDD at a larger unit of intent. In TDD, the loop wraps a unit of behavior: write the failing test, make it pass, refactor. In loop-driven development, the loop can wrap a task, a PR, a migration, or a recurring workflow.

The verifier is the difference between a loop and a vibe. Without a verifier, you have repeated prompting. With a verifier, the loop can converge. The verifier can be deterministic, like tests and builds, or probabilistic, like a separate reviewer model, but it has to exist outside the agent’s desire to be done.

This is also where the human role becomes more important, not less. The engineer chooses the goal, designs the context, sets the permissions, defines the checks, reviews the result, and decides what risks are acceptable. The loop can run faster than you can type, but it cannot decide what should matter.

The takeaway

Software was always written in a loop. TDD made the loop explicit around behavior. BDD and acceptance testing widened it toward product intent. AI is widening it again around agents, harnesses, and recurring workflows.

That is the shift from test-driven to loop-driven development.

Not because tests stop mattering. Because tests, evals, reviewers, sandboxes, worktrees, skills, memory, and CI are becoming parts of a larger loop.

Build the loop. Stay the engineer.

Subscribe to The Generative Programmer for practical maps, pattern catalogs, and production notes on AI coding agents.

Daniel Schermele

Jul 3

This was so well framed. The backpressure-vs-gate distinction really clarified something for me, your line that a failure the agent sees while working is backpressure, while a check after it's done is just a gate. I run my loop on that same split.

It left me chewing on a problem I keep hitting with reward hacking, though, and I'm curious whether you've landed anywhere on it. A gate can prove the code does what the spec says. But nothing in the loop checks the spec itself. So a confidently wrong or ambiguous spec produces a confident green, the agent built the wrong thing faithfully, and every gate downstream passes it. The trust boundary just moves up a level to the spec, where there's no backpressure at all. Have you found any strategy that puts pressure on the spec the way we've learned to put it on the code?

I've been building a gate that chases the version of this one level down, where the agent that writes the code can't change the test that grades it, enforced by tooling rather than asked for. It handles the collusion between code and test, but it dead-ends exactly at the spec problem above, which is why I'm asking. Wrote up what held and what didn't here if it's useful context: `https://misterscherm.substack.com/p/please-allow-me-to-backpressure-myself?r=r32hs.

Morty Smith

Jun 13

Insightful and engaging blog post, thank you.

Discussion about this post

Ready for more?