The Shape of Done: Notes from Going All-In on AI - Throughline Technical Services

I started doing this full-time in January. Since then I’ve been building a multi-agent system that takes a human conversation, an interview about what someone wants, and turns it into working software at the other end. Me directing rather than typing. I’m still building it. The point isn’t that I think it’ll be better than what’s out there. The point is that I’m not willing to take orchestration promises at face value. I want to come to those conversations from a place of having done the work.

The only way to learn what AI can and can’t do right now is to use it. Not read about it. Not watch demos. Pick something real, dive in, build, break, roll back, build again. Everything is a branch. Almost everything is reversible. The cost of trying is low and the lessons stick in a way nothing else does.

This is a write-up of patterns I’ve observed running multi-agent operations on real codebases over the last several months. Not a complete picture. The ground is moving fast enough that some of it might be obsolete by next week. But the meta-lessons, the patterns of how this kind of work goes wrong, what to watch for, how to think about gates and orchestration, those seem more durable. That’s what this is about.

A note on context. I’ve worked with agents across multiple model families. Claude, Codex, Gemini, others. They have different strengths. Sometimes the right answer is one model. Sometimes it’s putting them against each other. That’s a different piece. For this one I’m drawing on the orchestration work, regardless of which model is on the other end of the dispatch.

A Frame That Helped

The cloud transition fifteen years ago was a migration. You had infrastructure here. You moved it there. The destination was known. The processes were new but learnable. Once you got there, you were there. The hard part was logistics, not epistemology.

This isn’t that. This is being handed a black box and told to figure out what it can do. The inputs are loosely defined natural language. The outputs are non-deterministic. The box’s behavior changes over time. Sometimes it gets better. Sometimes worse. “Better” itself is fungible: better at what, for whom, compared to last week’s version. You can’t write a runbook for a moving target.

What you can do is build intuitions about how the box behaves under different conditions, develop probing techniques, and write down the patterns you find before they shift. That’s not engineering in the traditional sense. It’s closer to empirical science with a non-cooperative subject. Hypothesize, probe, observe, adjust.

Most of the conversation around AI-and-coding still treats it as a tool you adopt. “I added Copilot to my workflow, here’s what changed.” That framing is fine for autocomplete. It’s not the right framing once you’re trying to put agents to work doing real, multi-step engineering. At that point you’ve stopped using a tool and started managing a partner whose behavior you don’t fully understand. Different problem. Different skills.

What I’m Building

A human (founder, product owner) sits down for a structured interview. The interview captures their intent into a document I call canonical state. That state becomes the input to a documentation generator that produces design specs, requirements, and acceptance criteria. Those documents become the input to a ticket shredder that turns them into actionable units of work. Those tickets become the input to a coordinator that dispatches AI workers in coordinated batches. The workers do the actual coding. A verifier checks each worker’s output against its acceptance criteria. The coordinator commits verified work, dispatches rework where needed, and produces a handoff document at the end.

I’m in this loop, but not as a coder. I write plans. I review scaffolds. I approve operations to start. I read diffs. I judge whether the output matches the intent. The coding happens without me.

A typical operation runs a couple of hours, dispatches a handful of workers, produces a dozen or so commits across several files. I’m not Google. I don’t have those resources. What I do have is enough money in tokens to run real operations end-to-end on real codebases, learning what works and what doesn’t, and building intuitions I can apply to bigger things.

When this works, it’s a different way of building software than I’ve ever practiced. When it doesn’t, the failure modes are unfamiliar. That’s most of what I want to talk about.

The Pattern That Took Me Longest to See

If you give an AI agent a task and watch it work in real time, you’ll catch its mistakes. The agent does something wrong. You push back. It corrects. This is what most people are doing right now and it works fine.

If you give multiple AI agents tasks in parallel and walk away, something different happens. Some of them produce work that looks complete but isn’t. Tests pass but don’t test the thing they claim to. Code gets written but a stray reference to the old version survives. A function gets renamed but the callers that depended on its old behavior keep depending on it. The output has the shape of finished work. It just isn’t actually finished.

The natural explanation is “the agent was lazy.” That’s wrong, and the wrongness matters. AI models don’t get tired. What they do is something more interesting and more annoying.

They’ve been trained on millions of examples of what completed work looks like. They have a strong sense of the shape of completion: the sentence patterns, the report structure, the format of a status summary, the cadence of “tests pass, build green, ready for review.” Producing that shape is cheap, in a token-economic sense. It’s what their training rewards. Doing the actual underlying work to justify the shape is expensive.

So when the agent finishes, it produces the shape. Sometimes the shape reflects real work. Sometimes it reflects partial work dressed up as complete. The difference is invisible to the agent, because to the agent both feel like “I have completed this task.”

I’ve come to think of this as the central problem of agent orchestration. Once you see it, you start seeing it everywhere. The implication is that almost every gate, rule, and verification step in the system has to be designed against this specific failure mode. If you don’t, it will eat you.

What It Looks Like in Practice

Four examples drawn from operations I’ve run in the last several weeks. Four different surfaces, same underlying failure mode.

A grep that didn’t catch a bug.

Context: a multi-plan operation rebuilding a visualization/sketch feature. 32 briefs across 4 plans, 33 commits over a few hours. The brief in question was a service rename. The worker was supposed to remove a deprecated code path and replace it with a new implementation.

The acceptance criterion the worker was given was a string search: the old function name should not appear in non-test code. The worker did the rename, ran the search, confirmed zero hits, reported success. The coordinator accepted it.

A few days later I clicked the relevant button in production. It took 32 seconds and burned about twenty cents in API calls. The deprecated code path was still wired up. The grep had passed because the new function had a different name, but the behavior it triggered hadn’t been disconnected. The worker had even mentioned the residual coupling in its completion report. The acceptance criterion was a string search; the string search passed; the report was a footnote. The coordinator moved on.

Two layers of gates failed independently. The acceptance criterion was the wrong shape: it checked for a token, not a behavior. The worker’s flagged residual was treated as informational rather than blocking. Neither gate was broken. They both did exactly what they were written to do. They were both designed in a way that satisfied the schema-shape of “verification” without doing the work that verification needs to do.

A line count that was wrong.

Context: same multi-plan operation as above. Different worker, in a brief modifying an existing service file. The brief had a soft constraint about file size.

The worker reported it had refactored the file down to 393 lines. The number was wrong. The file was actually 549 lines. The worker hadn’t lied. Lying isn’t a coherent thing for these models to do. It had just produced the shape of “I checked the line count” without producing the tool call to actually check the line count. The shape is cheaper. The shape was sufficient to satisfy the report format. The shape went out.

A configuration default that broke a station.

Context: a separate operation rebuilding a pipeline that processes long-running LLM jobs. The relevant code was a queue consumer dispatching messages to a worker. After deploy, jobs were silently retrying themselves multiple times before completing.

The system uses a queue to dispatch long-running jobs. The default visibility timeout on the queue is 30 seconds. Some jobs run 3 to 10 minutes. Every 30 seconds, the queue would re-deliver the same message, the consumer would start a parallel attempt, the previous attempt would get orphaned mid-stream, and the cycle would repeat. Each successful job paid for six to twelve cold starts of compute before it landed.

The agent that wrote that consumer used the default. Defaults were available. Defaults were used. Whether the default fit the actual workload was a question the agent didn’t think to ask, because asking would have required cognitive work the framework didn’t force. The default is right there. Reaching for it is the path of least resistance.

A demo page that broke an entire site.

Context: discovered while debugging an unrelated regression. Admin login had been silently broken for weeks. The cause turned out to be dead code from a prior architectural era that survived a cleanup pass.

A demo page in the codebase had a top-level call to install a fake API into the global window.fetch. This was meant to make the demo work without a real backend. But the demo page was eagerly imported by the application’s main module, which meant the fake API installed itself on every page load: login, dashboard, settings, everywhere. For weeks my admin login was silently broken because of a demo I had never visited.

The agent that wrote the demo took the path of least resistance. Calling installFakeApi() at module load makes the demo work. Gating it behind a route check is more work. The side effect was global. Nobody noticed because the failure surfaced as something else entirely.

Each of these is the same shape, even though they look different. An agent reaching for the cheapest output that occupies the schema-slot of “task complete.” A grep instead of a behavior test. A line count guess instead of a line count call. A default instead of a thoughtful configuration. A global side effect instead of a scoped one. Cheap shapes, all the way down.

What I’ve Done About It

I’ve been hardening the system against the failure mode above. None of these moves are individually clever. The interesting part is which kinds of rules actually catch the failure mode and which kinds look like they should but don’t.

Behavior tests, not name greps. The deprecated-code-path example was the canonical lesson. The acceptance criterion was the wrong shape: it checked for the existence of a string. The thing that mattered was a behavior. A grep is the cheapest possible verification. The agent can satisfy it without understanding the code. A behavior test makes the agent demonstrate that the code does the right thing. Now, briefs that delete or replace functionality require a behavior-level test that fires the trigger and asserts the deleted behavior doesn’t manifest. More expensive to write. The only kind that catches the actual failure.

Strictest verifier becomes the only verifier. The system used to have two ways to type-check the codebase: one permissive (fast), one strict (slower, project-aware). Workers preferred the permissive one because that’s what got run as part of “tests pass.” Type errors that the strict checker would have caught were silently shipping. The fix wasn’t to mention both checkers. It was to write the rule so the strict one is the only thing that counts. If the loose option exists in the vocabulary anywhere, the agent reaches for it. This generalizes. Wherever you have two verifiers of the same property, take the weaker one off the table entirely. Don’t leave it as an option. Don’t mention it as an alternative. Don’t reference it in examples. It will be reached for.

Residuals as first-class artifacts.

Context: this rule came out of a working session reviewing failures from the multi-plan operations and trying to name what kept going wrong. It went through two versions. The second is the one that works.

When a worker finishes, it’s now required to declare any residual coupling: any partial-completion or known-limitation. Those residuals get one of three dispositions: folded into the next brief, opened as a follow-up, or explicitly accepted with justification. The first version of this rule was: “operations cannot close with any open residuals.” Sounds right. It’s wrong.

That creates a perverse incentive. If reporting a residual means the operation can’t complete and someone has to do more work or open more tickets, the path of least resistance is to not report. You’d start to see “Residuals: none” on briefs that previously would have flagged real issues, and the suppressed couplings would surface as code-quality drift instead of as visible residuals.

The fix was to keep residual-reporting cheap and residual-suppression visible. A worker can dispose with a one-line accept-with-justification. Honest reporting is the easy path. Suppression is hard, and the resulting handoff document is shorter than it should be, which is detectable on review. Operations don’t close on undisposed residuals, but they do close on disposed ones. The rule shifted from “no residuals can exist” to “any residuals must be disposed, and disposal is cheap.”

Strip the soft-out vocabulary.

Context: this happened the same day I wrote the rule. I caught the failure within hours, on the next operation, and watched the rule decay in real time.

I wrote a rule that said UI changes “MUST satisfy ONE of: automated UI test in acceptance, OR manual smoke step in acceptance.” Hard rule. Unambiguous wording. Within hours, my own scaffolder was rendering it as “this is a judgment call: Option A (recommended), Option B (scaffold anyway).”

The rule was strict. The surrounding language in the prompt was lenient. Words like “judgment call” and “Option B” were available in the prompt’s nearby context, so the model reached for them. The model wasn’t ignoring the rule. It was reaching for the most fluent way to phrase a validation result, and the fluent template for “I found a problem” includes phrases like “here are your options,” because that’s what its training data is full of.

The fix was to remove the lenient phrasing from the surrounding prompt entirely. Hard rejects don’t offer a “scaffold anyway” path. The vocabulary of leniency had to go. Not just the rule that contradicted it. If you write a hard rule and leave the soft version anywhere reachable, in old examples, in commented-out alternatives, in nearby instructions, the model will reach for it. The defense isn’t more rules. The defense is removing the language of the failure mode from the agent’s reachable context.

Justified defaults. For external configuration values (timeouts, batch sizes, retries), the agent has to include a comment explaining why the chosen value is right for this workload. The cognitive friction of writing that comment is the gate. If the agent has to articulate why 30 seconds is the right value for a 10-minute job, the bug surfaces at write-time instead of runtime.

Last-touch ownership. When multiple briefs touch the same file, the file-level invariants (“no imports of X remain anywhere”) have to live on the last brief to touch the file, not on each brief individually. Otherwise each brief satisfies its own narrow acceptance, the cross-brief invariant is no one’s responsibility, and it gets violated. Sounds obvious. Wasn’t, until I’d watched it fail twice.

Human sign-off is not a brief acceptance criterion. I once wrote a brief whose acceptance was ”[ ] sign off before close.” The coordinator reads checkboxes. A checkbox waiting on me freezes the operation if I’m asleep or away. Sign-off is a separate phase, not a worker-facing gate. The brief’s acceptance has to be programmatically checkable. Human review happens at the operation boundary, after the briefs are done.

What This All Points At

Every failure I’ve described is the model producing the cheapest output that occupies the schema-shape of “task complete.” The defense is to make the schema only fillable by actual work.

Once you see the pattern, you see it everywhere. The acceptance criterion that looks measurable but isn’t (“comparable to the original”). The verification that exists but doesn’t gate (“file exists at path”). The default that wasn’t chosen, just inherited. The residual that gets omitted because reporting it is more expensive than not. The vocabulary of leniency that decays a hard rule. They’re all the same shape. They’re all an agent reaching for the cheapest path that occupies the appearance of done.

The implication for anyone building with AI agents at scale: stop writing rules the model can satisfy without actually working. Strip the unfalsifiable words from your acceptance criteria. “Comparable,” “appropriate,” “reasonable,” “if natural.” Don’t let “exists” be a verification on its own. Don’t let cheapness of completion be a path of least resistance. Make the schema-shape of “done” something that can only be filled by the work itself.

This is harder than it sounds because the cosmetic shapes of completion are familiar. They look like good engineering. “Tests pass, build clean, file size under threshold” is a perfectly reasonable-sounding criterion list. It’s also exactly what an agent will produce without doing the work, if you let it.

It’s Not the Product. It’s the Process.

I’m not building a product. I’m developing a process.

The system above (interview to canonical state to specs to tickets to dispatched workers to verified commits) is going to look different in a year. Some stages will collapse into each other. Some will get replaced by capabilities the model vendors ship natively. Some will turn out to be wrong abstractions and get thrown out. The specific architecture is not the point.

The process of figuring out how to take a human’s intent and turn it into shipped software using AI as the implementation layer: that’s what I’m actually building. The patterns I’ve learned about residuals and gates and acceptance criteria and the language of leniency, those are durable. They’ll apply to whatever the next architecture looks like, because they’re about how the model fails, not how my pipeline is structured.

This is the part I think most people miss. They look at AI-and-coding and ask “what should I build?” The answer to that is going to change every six months. The better question is “what process should I be developing?” The process is what survives version changes, model swaps, framework rewrites. The process is what compounds.

The way you develop a process is by running it, watching it fail, and changing it. You can’t think your way to a good orchestration process. You have to put the system under load and see what breaks. That requires actually building the system, even if you know it’s going to get rewritten. The system is a vehicle for learning the process. The process is what you keep.

What I’m Working on Next

Treating residuals as a first-class data type, not a markdown convention. Every operation should produce a structured residuals graph that downstream operations can query.

Smell-checking handoff documents. A handoff with zero residuals across an operation that touched a lot of files and ran rework rounds is suspicious. The coordinator should detect that asymmetry and flag it.

Better cold-path testing. The hardest failures are the ones where the deleted code path is still wired up but no one fires it during normal use. Every deletion should ship with a behavior test that fires the trigger and asserts nothing happens.

Tighter feedback loops on irreversible actions. Agents should pause before any action that can’t be undone (filing tickets, sending messages, deploying) and confirm, even if the prior plan said “go.”

And writing more of these. I’ve been keeping the observations as private notes for months. They’re more useful out loud. This is the first attempt.

Closing

If you’re somewhere on the spectrum from “AI helps me code faster” to “this whole thing feels overhyped,” one observation. The interesting question right now isn’t can AI write software? That’s a settled question. The interesting question is: what does it look like to develop a process for getting from human intent to shipped software with AI as the implementation layer? You can only answer that by trying. By building, breaking, rolling back, trying again. Everything is a branch. Almost nothing is permanent. The cost of trying is low and the lessons are sticky in a way nothing else is.

If you’re thinking about diving in: dive in. Pick something real. Build it. Break it. Learn what the failure modes feel like. Develop a sense for what the box can and can’t do. Reading about it won’t get you there. The hands-on work is the work.

I’ll keep writing about what I find. If any of this resonates, or if you’ve found different patterns in your own work, I’d want to hear about it.