Vibe engineering · · 9 min read
Code review in the age of agents: who reviews AI code
Two reviews: the agent catches style and types, the human catches the misread domain.
When an agent writes the code, the first question is not “does it work” but “who reviews it”. The old intuition — the author writes, the reviewer reads — stops fitting, because the “author” is a model that bears no consequences and does not remember yesterday's decision. Code review does not disappear. Its structure changes. The best teams I watch set up two reviews: an agent reviewer first, then a human. Each catches a different class of bugs, and that is the whole point.
The two-review model
The first layer is the agent reviewer: a separate model pass that reads the diff before a human does. The second layer is the human, who reads what is left after the agent. The order is not accidental. The agent is cheap and never tired, so it goes first — it sifts out the noise before it starts costing senior time. The human comes last, because human attention is the most expensive resource in the whole process and must not be wasted on things a machine catches by itself.
This is not “the same thing twice”. If the agent and the human caught the same bugs, the second layer would be waste. The value appears precisely because their fields of vision do not overlap. The agent sees patterns in the code. The human sees meaning in the problem.
Why each review catches different bugs
The agent reviewer is excellent at the mechanical and repetitive: inconsistent style, missing types, an unhandled error branch, a naive code path nobody will test, duplication the author missed because they only saw their own file. The model reads a thousand lines without boredom and flags twenty places where null handling is missing. It does this consistently at 9am and at 11pm, on the first PR of the day and on the thirtieth.
The human catches what the model structurally cannot see: a misread of the domain. The agent was told “compute the discount” and it computed the discount — except that in this company the discount is calculated on the net price, not the gross, and that fact is nowhere in the code because it lives in the heads of three people. The model does not know it is writing a correct answer to the wrong question. That is caught by a human who knows the context.
The second thing reserved for the human: the “half-solved problem”. An agent can close a ticket with code that handles the happy path and looks complete, while staying silent about the skipped data migration, the empty-list edge case, and API backward compatibility. The code compiles, tests pass, the PR looks done. Except the problem is 50% solved, and the remaining 50% is exactly what blows up in production. The model has no sense of “does this actually close the topic”. The human does.
The agent reviewer comments, never approves alone
This is the rule you do not break: the agent reviewer may comment, but it may not approve a PR on its own. The difference between a helper and a decision-maker is absolute here. The moment the agent starts merging its own remarks without a human decision, you close a loop in which the machine judges the machine — and both have the same blind spots.
The reason is practical, not philosophical. The agent reviewer runs on the same training distribution as the agent author. The same assumptions the author made silently, the reviewer treats as obvious and waves through. A domain bug invisible to the author is also invisible to a reviewer from the same model family. That is why human sign-off is not a formality — it is the only point in the loop where context from outside the training data enters.
In practice: the agent reviewer leaves comments on the PR, flags risky spots, suggests fixes. The “approved” status is given by a human, deliberately, taking responsibility for what lands on main. The agent is never the last signature.
What happens when the agent ships 5x more PRs
This is where the real scaling problem begins. The agent does not write five times better — it writes five times more. One engineer with a well-tuned agent opens as many PRs as a small team used to. The bottleneck stops being writing code. It becomes review.
Teams that did not anticipate this fall into one of two modes. Either review becomes a rubber stamp — people click “approve” because the queue has forty items and nobody reads them honestly. Or review becomes the bottleneck that erases the entire speed-up from the agents — you write 5x faster but ship at the old pace anyway, because PRs sit in the queue for three days.
The way out is not to read faster. It is to read less, but smarter. The agent reviewer should take over the whole bottom of the pyramid: style, types, obvious gaps. The human receives a pre-filtered PR with risk spots already marked, and reads only what needs human judgement. The second mechanism is differentiation: a PR changing footer text does not need the same review as a change in the billing code. The third: small PRs. An agent that ships 5x more should ship 5x smaller chunks — reviewing ten small diffs is more honest than reviewing one enormous one.
Review fatigue and how to fight it
Review fatigue is not laziness — it is the physiology of attention. After the twentieth PR of the same day a human no longer reads, they scan. The brain switches to pattern matching: “looks like the previous ones, probably fine”. That is exactly the moment a domain bug slips through, because by definition it looks correct.
With agents the problem intensifies, because agent code is uniformly neat. No typos, no odd indentation, nothing to stop the eye. Everything looks good, so the reviewer stops pausing. Paradoxically, the cosmetic perfection of model code dulls vigilance faster than a junior's rough code that forces you to think.
What works in practice. Cap reviews per person per day — past a certain threshold quality drops below zero, meaning you let in more bugs than you catch. Rotate reviewers so nobody reads thirty PRs in a row. Require the author (the agent too) to describe “what and why” in the PR, because reading starts from intent, not from the diff. And separate “trivial” reviews from “risky” ones — do the latter in the morning, with a fresh head, not at 6pm as the twentieth in a row.
A practical PR checklist for the age of agents
This is the list a human goes through deliberately, after the agent reviewer has already left its comments. The agent handles the mechanical points; this list is about what the model will not catch:
- Does this code solve the right problem, not the one next to it?
- Is the problem closed, or is this only the happy path? Where are the edge cases?
- Are the domain assumptions correct for this company, not for the textbook?
- What happens to data migration and backward compatibility?
- Do the tests check behaviour, or only that a line executed?
- Did the agent reviewer flag risk spots, and was each one resolved?
- Is the PR small enough that it could be read honestly?
- Am I taking responsibility for this merge — or am I clicking “approve” out of fatigue?
The last point matters most. If the answer is “clicking out of fatigue”, the process has too many PRs for too little human attention — and you fix the process, you do not push the PR through.
Myth: since the agent writes it, let the agent review it
Tempting and false. Yes, a second agent will catch some things — and it should, that is the entire role of the first layer. But two models from the same family share the same blind spots. Domain interpretation, the “half-solved” problem, and responsibility for production stay human. The agent reviewer scales review down the pyramid, where bugs are numerous and mechanical. The human guards the top, where bugs are rare, expensive, and invisible to the machine. Remove the human and you remove the only point in the loop that sees beyond the training data.
TL;DR
Code review in the age of agents is two reviews: agent reviewer first, human second. The agent catches style, types, and mechanical gaps; the human catches the misread domain and the half-solved problem. The agent comments, never approves alone — because it shares the author's blind spots. When the agent ships 5x more PRs, review becomes the bottleneck: save it with filtering, risk differentiation, and small PRs. Fight review fatigue with caps and rotation. Stick to the checklist — and the last question is always: am I taking responsibility for this merge.