Turns out AI development needs all the same things human development does

For a while I’ve been “experimenting” with AI-assisted development; which meant opening a chat window, describing what I wanted, and hoping for the best. This is about what I built to make it more reliable than that — and how every fix turned out to be something good engineering teams already do.

The project

I work in a field where software reliability genuinely matters — we have contracts with local government, SLAs, and the kind of accountability where “the AI did it” is not a defensible answer. So before bringing AI-first development into that context, I wanted to stress-test it somewhere lower-stakes but structurally similar.

In my spare time I’ve built and maintain Muddle — a meal planner I built for my own household. It’s a real Rails app, in daily use, and my household depends on it to actually plan meals. I can’t vibe-code changes into production — there are hungry people who would notice if the weekly plan disappeared. The constraints that mattered at work showed up here too: existing project, real users, code quality that another human could read and maintain. Just at a friendlier scale.

Stage one: just asking, and why that didn’t work

The obvious starting point is also the one everyone tries first: describe what you want, let the model run with it. And it’s fine for throwaway scripts or isolated functions. But for a living codebase with existing conventions and real users, it breaks down quickly.

The problems I kept running into were:

Plain old misunderstood requirements. I’d ask for a change and get something that technically works but not what I actually wanted. It was in the wrong place, or it worked in a way that didn’t make sense to me or the users.

Drift from conventions. The app has a particular way of doing things — Rails idioms, ViewComponent patterns, specific test structures. A generic prompt produced generic code that didn’t fit.

No checkpoint. The model would make three decisions I’d have made differently, and I’d only find out when I looked at the diff. By then I was either accepting the whole thing or untangling it manually.

None of these are catastrophic on their own. But they compound. After a few sessions of cleanup work following “successful” prompts, I realised I was spending more time correcting Claude than I would have spent just writing the code.

There’s an old mantra in software engineering: the computer does exactly what you tell it, not what you want. That’s still true with AI — it’s just harder to see. The gap between “what you said” and “what you meant” is buried in the interpretation of natural language rather than a compiler error. The AI fills in blanks well enough that a misunderstanding feels like a reasonable attempt rather than a crash. At least a compiler tells you immediately that you got it wrong.

Stage two: telling Claude how we do things here

Even when Claude understood what to do, the code it produced didn’t look like the rest of the codebase. Not broken — just wrong for this project. Instance variables where we use locals. Rubocop conventions instead of Standard Ruby. Minor things, but the kind that get flagged in a code review. Someone has to clean it up, and that someone is me.

The fix was CLAUDE.md — a file that Claude reads automatically as context for every session. Coding conventions, style rules, migration handling, test structure. It’s the same thing a good CONTRIBUTING.md does in an open source repo: here’s how we work, here’s what we care about, here’s what will get your PR rejected. The agent is just a new contributor who needs the same onboarding.

This worked. The code started looking like it belonged in the codebase.

Stage three: only load what’s relevant

The CLAUDE.md approach introduced a different problem almost immediately. When I was having a conversation about the logo, or thinking through a product decision, the coding rules were sitting in context the whole time. Suggestions had a slight pull toward code. Discussions about what something should look like would drift toward how it might be implemented.

The fix was treating CLAUDE.md as a map rather than a dump. The top-level file became a short index pointing to linked documents, and each conversation only loads what’s relevant to it. None of them load all of it.

The reason for the structure wasn’t agents — it was regular prompts drifting in the wrong direction because of rules that had nothing to do with the conversation at hand. Agents benefited from the same approach later, but that came after. It’s the same instinct as not handing a developer every company policy document before asking them to fix a bug — give people what’s relevant to the task, not everything that exists.

Stage four: better briefs, and the proto-epic

Instead of describing what I wanted, I started asking Claude to reflect the requirement back at me — what it understood, what it would change, what it wouldn’t. This caught misunderstandings before they became diffs. It also forced me to think more clearly about what I actually wanted.

This evolved into something more structured: drafting the requirement in a file, iterating until we both agreed on what success looked like. That’s just acceptance criteria — getting shared agreement on what “done” means before any work starts. I wasn’t inventing anything; I was rediscovering why that practice exists. Those files became the basis for the epic documents in the project. It just turned out that “a well-reasoned, scoped description of what to build and why” is exactly what an epic is.

Stage five: a Head of Product agent

The next gap was that I was still the one deciding what to build. I’m the founder (that’s what Claude keeps calling me — I never asked it to), which means I have the usual blind spots — attached to ideas, optimistic about scope, bad at killing the darlings.

I created a Head of Product agent with a specific brief: consistent product thinking, honest challenge, anchored to the user archetype. Not “build what I say” but “is this actually the right thing to build?” The HoP would push back on briefs that were solving the wrong problem, suggest smaller versions of ideas, or flat out tell me something wasn’t worth building yet.

A side effect of those conversations was that decisions needed somewhere to live. Which ideas were worth building, which were parked and why, what order things should happen in — that’s a roadmap, and it emerged from the conversations rather than being written upfront. Now it lives in the project as a document agents read when making prioritisation decisions, rather than reconstructing the logic from scratch every time.

Stage six: visual identity and brand documentation

Every time I worked on a brief that touched the UI or copy, the same questions came up. What’s the tone? What’s the font? Is this on brand? And I was answering them from scratch each time — or worse, answering them slightly differently each time.

The decisions had been made. I knew the typeface, the colour direction, the brand personality. None of it was written down, which meant every brief that touched design or copy was relitigating it.

So I wrote docs/visual-identity.md and docs/product-positioning.md — not to prevent drift in output, but to stop having the same conversations. Written down once, referenced forever. Briefs got shorter and sharper, and the decisions became easier to challenge and update rather than existing as a fuzzy shared understanding that no one could quite articulate.

Stage seven: feature documentation

Once I started being deliberate about what each prompt loaded, it became obvious that the HoP agent had a different problem: not enough context about the app to do its job without going looking for it.

Left to its own devices, it would explore the entire codebase on every task — burning tokens re-discovering things it had found last session. Or it would ask me questions I’d answered before. Or — most frustrating — write briefs full of conditionals: “if the app already has email infrastructure, do X, otherwise do Y.” That’s not a brief, that’s a brief with a question buried in it.

The fix was feature documentation — a clear enough picture of what the app does and how it’s put together that the HoP could answer those questions itself. Does the app send emails? Yes, here’s how. Is there already a notifications system? No. Written down once, the agent stops asking and the briefs become definitive rather than conditional.

This is the same problem that exists on real teams — a product owner who has to ask a developer how something works before writing a sensible brief, or doesn’t ask and hedges with conditionals instead. The agent just makes the cost of not having documentation more visible, because it’ll ask you the same question every single session.

Stage eight: a Head of Engineering agent

By this point the product thinking was solid and the documentation was good, but I still had the checkpoint problem. Agents would get a brief, understand it correctly, then immediately start making changes — including changes I’d have stopped if I’d been asked.

The Head of Engineering agent was the answer. Its job is to turn a brief into an implementation plan before any code is written. The plan gets reviewed. I can push back on the approach, flag something dangerous, or redirect before anything gets changed. Good teams do this as a matter of course: a tech spec or design review gets signed off before anyone writes a line. The HoE is just that process applied to an agent. It’s much easier to say “don’t touch the authentication layer” at the plan stage than to untangle a diff where it got touched anyway.

What this whole thing actually is

What I ended up building is a lightweight version of the organisational infrastructure that already exists in good engineering teams. The documentation is the same documentation you’d write for a new human engineer. The agents are the roles you’d see in a small product team. The review checkpoints are the checkpoints you’d expect in any professional codebase.

None of this was the plan. I started by trying to get Claude to make a small change to a meal planner. The setup emerged from noticing where things went wrong and adding structure to prevent it — a series of informal retrospectives. Something doesn’t work, you figure out why, you add the structure that would have prevented it. Same thing good teams do after every sprint, just without the sticky notes.