Vibe engineering · · 6 min read
How to keep code quality in vibe engineering: 5 practices
Five practices that hold quality when an agent writes most of the code.
Quality in vibe engineering does not come from “good discipline”. It comes from five practices that together form a shield against regressions. A single practice is not enough. Two are enough to start. Three give comfort. Four and five — that is where most “why is it broken again, tests passed” problems disappear.
Practice 1: spec as a gate
Every change touching more than three files has a SPEC.md or issue with formalnon-functional requirements (latency, LLM cost, rollback path) andfunctional ones. The spec is short (400-800 words) and ends with an “Out of scope” section. Without that section, every implementation drifts by 30%.
You (human) write the spec. The agent can help with the first draft, but you sign off. No compromises: without human acceptance the spec becomes a generative placeholder.
Practice 2: two reviews
First an agent reviews the PR (e.g. a dedicated reviewer subagent). Then a human. The two reviews catch different bug classes. The agent catches style inconsistencies, missing types, naive code paths. The human catches misread domain logic and “half-solved problem”.
Crucially: the agent-reviewer cannot approve alone. It can only comment. That is the difference between a helper and a decision-maker.
Practice 3: tests with mutation or property
80% traditional coverage tells you only that the line executed. Mutation testing (Stryker, PIT, Cosmic Ray) changes one instruction and checks whether a test caught it. If not, the test was a crutch.
Property-based testing (fast-check, Hypothesis) generates hundreds of random cases. One of them catching a regression before prod saves you thousands of postmortem hours.
In vibe engineering we recommend at least one of the two. Ideally both on the key modules.
Practice 4: CI gates
Gates you cannot bypass:
- typescript strict with no errors;
- lint with no warnings (or explicit allow list);
- unit + e2e tests — full (no .skip without an issue);
- mutation testing — 70%+ killed mutants;
- bundle-size delta < 5% (alert above);
- security scan: zero high+critical, zero secrets in diff;
- LLM evals on prompts: 90%+ pass rate on regression set.
Each gate catches a different bug class. Skipping any “just this once” starts a quality debt you pay back at night.
Practice 5: production telemetry + auto-rollback
After deploy: alerts on p99 latency, 5xx errors, LLM cost per request. Auto-rollback trigger on threshold breach (Vercel deployment rollback API, Kubernetes rollback).
Second layer: incident response. A runbook for “what to do when the alert fires”. Third: a blameless postmortem after every incident with 5x “why”.
Myth: AI will provide quality on its own
It will not. AI is very good at generating code that “looks working” but has subtle problems. Without the five practices above, an LLM-generated PR ships faster but lands in production with more risk than a junior PR under mentor supervision. Not because AI is worse. Because AI writes faster, so every gap in your process scales with that speed.
Team objections
“Five practices is too much for our small team.” Each practice costs a few hours weekly. No practices costs a few days monthly rescuing prod. The maths hold even for a three-person team.
“We are a startup, we do not have time.” Spec, two reviews and one type of test is the minimum for a startup after the first paying customer. The rest comes with time.
Phased rollout
- Week 1-2: spec for every change > 3 files.
- Week 3-4: agent-reviewer in CI (comments, does not block).
- Week 5-8: mutation testing on one (most critical) module.
- Week 9-12: CI gates with full enforcement.
- Week 13+: production telemetry, auto-rollback, runbooks.
TL;DR
Vibe-engineering quality is five practices: spec, two reviews, mutation/property tests, CI gates, production telemetry with auto-rollback. Alone they are not enough. Together they turn vibe engineering into a way of working that scales to real production without nightly drama.