What a production AI agent actually looks like
Everyone pictures an agent as a model and a clever prompt. I built one that runs my own engineering work end to end. The model is maybe a tenth of it. Here are the other nine.
AI writes code ten times faster. The rest of the job hasn’t moved. And that gap is where everything happens.
When code ships ten times faster, the reflex is to skip whatever slows you down: no ticket, no analysis, no spec, you just code. It feels fast in the moment. Six months later, it’s the debt you drag around.
The standard development cycle exists for a reason: an idea, a ticket, an analysis, a design, a plan, the code, a review. Each step catches what the previous one left fuzzy. The cycle was never the problem. The problem is that at the speed AI produces code, nobody holds it all the way through anymore.
I didn’t want to trade speed for rigor. So I built, for my own projects, an agent that runs the whole cycle end to end, at the speed AI works. Not a chatbot. A pipeline.
This is the concrete companion to why most AI projects die between POC and production. The first piece argued that the models work and the systems around them don’t. Here is one of those systems, in detail.
The real shape of the system
The agent isn’t a chatbot. It’s a pipeline. Work enters as a ticket and moves through phases:
idea → discovery → tech spec → planning → implementation → review
Each phase is a separate, autonomous AI session with one responsibility and one deliverable. Discovery writes the functional spec. Tech spec turns it into a design. Planning derives a plan. Implementation writes the code. Review reads it back.
No single monolithic prompt covers all of it. Five focused sessions do, each handed only the context it needs. The first architecture decision that matters isn’t which model, it’s how to cut the work: units small enough that one session carries one, and gets judged on that single deliverable.
The model, in all this, is maybe a tenth of it. Here are the other nine.
Not every task takes the same path
A single pipeline for everything is the first thing that breaks on contact with reality. So the agent routes each task by its shape, not just its content.
- A feature runs the full five phases.
- A hotfix skips
tech specandplanning, and runs on a prompt built for diagnosis. You don’t handle a bug like a feature: a one-off defect needs no design document, and forcing it through one wastes two LLM round trips producing deliverables no one will read. - An epic stops after discovery and tech spec, then splits into child tickets that each run their own pipeline.
This routing has nothing flashy about it, and it’s the most decisive step there is. Cost and latency are settled here, before a single token is generated, by giving each task only the phases it warrants.
The cheapest model that fits
The model question comes only afterward. The agent doesn’t run on a single model: at each phase, it takes the smallest model that can do it.
Because no two phases need the same horsepower. The heaviest ones, designing an architecture or building the plan, go to the most capable model: that’s where everything is decided, and a bad plan poisons the whole rest of the pipeline. Phases whose scope is already fixed upstream, like writing code from a plan and tests, or reviewing, get by on a lighter model. None of this is a matter of principle: it’s the system’s default setting, and any phase can move up to a stronger model the day it needs to.
The logic is deliberately dull, and that’s the point. Paying top-model rates for an already-scoped task is by far the most common way to watch an AI bill blow up in production. The fix is one routing decision, made once.
Tests before code
Implementation doesn’t start by writing code. It first writes the failing tests, derived from the spec. The code then has a single goal: make them pass.
That changes two things. The definition of “done” becomes objective: the tests pass, or they don’t, and there’s no grey zone where the agent calls itself satisfied too easily. And the tests become a safety net: the session that touches this code later sees at once whether it broke something that worked. It’s also why implementation can run on a more modest model. The work is bounded by tests that say, with no ambiguity, when it’s right.
What makes it an agent: the validator loop
This is where the system stops being a plain script.
When a phase’s session ends, the pipeline doesn’t move on by itself. A validator takes over, another AI session, another model, a deliberately narrow role, and reads the deliverable just produced against the previous one. Does the design really respect the functional spec? Does the plan really cover the design? If it passes, the pipeline advances. If not, the producing session resumes, its precise objections fed back into the context, and corrects. The validator doesn’t reason, it checks: so it runs on the cheapest model of the bunch, an order of magnitude below the rest.
Three details only came right through real production runs, and they’re what separate a demo from a system:
- The session carries its own history. A resumed session gets the feedback from all its previous attempts, not just the last one. Without that, it fixes the new comment while quietly undoing what it had already repaired, and the loop oscillates forever.
- It stops on purpose. The number of automatic attempts before handing off isn’t a round number picked for comfort. I tuned it on real tickets: it’s the point where the loop still converges. Past it, the odds of success collapse. Either the agent never gets there, or it ends up satisfying the validator with a solution that ticks the boxes but is still wrong. Either way, better to hand off to a human than to grind toward a bad answer.
- No conductor. No separate orchestrator process, no scheduler, no event-bus subscriber coordinating any of this. The “a step delivers, you validate” cycle is the loop, on its own. The design that survived production is the simplest one: the fewest moving parts that can break at 2am.
That loop, produce, check against the source of truth, correct, know when to stop, is the agent. The model is just one component of it.
The details that decide everything
A system that holds up in production comes down to architecture details no tutorial mentions. Three that cost me dearly:
- Each ticket in its own git worktree. Two sessions running in parallel on the same repo corrupt each other’s work. Worktree isolation isn’t a convenience, it’s what makes parallelism possible without constant manual reconciliation.
- CI is the judge, not the model. The agent never declares itself “done.” It pushes, waits for CI to go green, and if it breaks, a dedicated phase reads the logs, understands the failure, and fixes it. Success is proven by an outside authority, not asserted by whoever just wrote the code.
- Context is a budget, not a dumping ground. On a long pipeline, each session gets only the strict deliverable it needs, never the whole history. Reinjecting everything blows up the cost, inflates latency, and degrades quality: a model drowning in context misses what matters in the middle.
None of this is in the model. And all of it decides whether the model ever does its job.
What’s left when you swap the model
Swap the underlying model tomorrow: the system keeps running. You adjust a threshold, maybe a default routing value, and you move on.
And that freedom is very concrete. You move from one provider to another without rewriting the agent. You add a backup model that takes over when the main provider goes down. You hand sensitive tasks to an open model hosted in-house, and the rest to a market API. Each time, routing decides who does what. The architecture itself doesn’t move.
Strip out the routing, the validator loop, the resume that carries its history, the worktree isolation, and the best model on the market produces nothing but an expensive, unobservable mess.
The model is interchangeable. The architecture is not. That’s the whole difference between a project that ships a pretty demo in spring and one that’s still running, on call, in production, the spring after.
If you have an agent stuck on the wrong side of that line, working in a demo but not yet a system, that’s exactly the ground I work on. Shall we talk?