Vibe engineering · · 7 min read
Spec-driven development with AI: spec first, code second
Write the spec, let the agent implement. Why a written spec beats prompting from memory.
Most teams use AI as if they were prompting from memory. You open an agent, describe the task in three sentences, get code, fix it, get more code. It works — until the task no longer fits in one sentence. Then the agent guesses, and you find yourself fixing the consequences of your own underspecified intent. Spec-driven development flips the order: first you write the specification, then the agent implements it. Code is the output, not the starting point.
Why a written spec beats prompting from memory
A prompt from memory has three flaws that scale with the agent's speed. First, it is ephemeral — it falls out of context after a few turns, so the agent starts working from its own interpretation from ten messages ago. Second, it is incomplete — in your head you hold constraints (time budget, data format, rollback path) you never say out loud because “they're obvious”. To an agent, nothing is obvious. Third, it is unverifiable — since you never wrote down what was supposed to be built, you have no way to check whether it was.
A written spec removes all three. It is durable, so the agent returns to it every turn instead of guessing. It is complete, because the act of writing forces you to name the constraints that would otherwise stay in your head. It is verifiable, because it contains acceptance criteria you can hold the finished code against. This is not bureaucracy — it is moving thinking out of the fixing phase and into the design phase, where it costs ten times less.
Anatomy of a good SPEC.md
A good spec has a fixed structure, so both the human and the agent know where to find what. Five sections cover 95% of cases.
Goal. One sentence, two at most: what and for whom. Not “how”, but “why”. If you cannot summarize the goal in two sentences, you probably have two tasks pretending to be one.
Functional requirements. A list of observable behaviors. Each item describes something the system does, ideally as “when X, then Y”. This is the meat of the spec and where you spend most of your time.
Non-functional requirements. The constraints that are easy to miss: latency, LLM call cost, memory limits, backward compatibility, rollback path, data safety. These are usually what decides whether an implementation is production-grade.
Out of scope. An explicit list of what we are not doing this iteration. Without this section the agent — helpful to a fault — adds features nobody asked for, and you review code that should never have existed.
Acceptance criteria. A checkable list of “done” conditions, ideally in a form you can turn into tests. If a criterion cannot be verified, it is not a criterion — it is a wish.
How the agent uses the spec as a contract
Once the spec is ready, it stops being a note and becomes a contract. The agent receives it as stable context and works inside its boundaries: functional requirements tell it what to build, non-functional ones the limits to stay within, out of scope where to stop, acceptance criteria when to consider the task done.
This shift has practical consequences. First, you can ask the agent to check its own work against the acceptance criteria before declaring it done — and to send the work back for a fix without your involvement. Second, when the agent wants to do something outside the spec, it is immediately visible: either it is breaking the contract, or the spec was incomplete. In both cases you get the signal early, not at review after two hours of generation.
The contract also changes the nature of your review. Without a spec you read the code and ask “does this do what I wanted?” — leaning on your memory of what you wanted. With a spec you ask “does this meet the acceptance criteria?”, and the answer is binary and independent of your mood. Review stops being a negotiation and becomes a check against a standard agreed in advance. The same holds for an agent-reviewer: it has an explicit reference point, so its comments are concrete instead of vague.
Iterating spec-first instead of code-first
The key mental shift: when the implementation goes wrong, you do not fix the code — you fix the spec. Code is cheap and reproducible; the agent will regenerate it in a minute. The spec is the source of truth, so that is where you fix the cause, not the symptom.
In practice it looks like this: the agent delivers an implementation, you read it through the lens of the acceptance criteria, you find a mismatch. Instead of typing “change this and that” in chat, you go back to the spec, add the missing requirement or clarify the ambiguous one, and ask for the affected part to be re-implemented. After a few iterations the spec is precise and the code is its faithful reflection. Chat fixes do the opposite: the code lines up, but the spec falls behind and nobody knows what is true anymore.
This approach also protects you from the most common AI trap: quickly shipping something that looks like it works but solves half the problem. Acceptance criteria are merciless — either all of them are met or the task is not done, no matter how nice the code looks.
A short spec is a good spec
A spec should fit in 400–800 words. This is not an arbitrary limit — it is the threshold below which the spec stays readable at a glance, for both a human and the agent's context window. A longer spec is usually a symptom of one of two problems: either the task is too big and should be split into several smaller specs, or you are describing the implementation instead of the intent.
The second trap is sneaky. The more “how” you put into the spec, the less room you leave the agent for a solution that might beat your first idea, and the faster the spec goes stale with every code change. The spec describes what and why; leave thehow to the agent, unless the “how” is a hard non-functional requirement.
An example spec skeleton that fits the limit:
- Goal: an endpoint to export a client's invoices as PDF, used by the billing panel.
- Functional: when a client requests an export, the system returns a PDF with all invoices from the chosen period; when the period is empty, it returns a clear message; when the client lacks permission, it returns 403.
- Non-functional: response under 2 s for 100 invoices; no PDF stored on disk; logging without personal data.
- Out of scope: email delivery, export to other formats, recurring billing.
- Acceptance: a test for a period with invoices returns a valid PDF; an empty-period test returns a message, not an error; a no-permission test returns 403; p95 response time under 2 s.
How this scales across a team
Spec-first stops being a personal discipline and becomes a team protocol. The spec is a shared artifact you can put in a PR next to the code, link in an issue, and review just like a diff. Reviewing a spec is cheaper than reviewing an implementation, because you read it in a minute and it catches wrong assumptions before anyone writes a line of code.
For a team this gives you three things. Onboarding: a new person reads the spec and understands the intent without code archaeology. Consistency: two people with the same spec get comparable solutions from the agent, because the contract is the same. Auditability: after the fact you know what was supposed to be built and why, so decisions are reproducible rather than locked in someone's memory.
In practice, adopt a simple rule: every change touching more than a few files gets a spec that a human approves before implementation. The agent can write the first draft — it is good at that — but the sign-off is human. Without that decision the spec stops being a contract and becomes just another generated text nobody trusts.
TL;DR
Spec-driven development with AI flips the order: first you write a short (400–800 word) SPEC.md with a goal, functional and non-functional requirements, an out-of-scope section, and acceptance criteria, and only then does the agent implement it. The spec is a contract — the agent works inside its boundaries, and when something goes wrong you fix the spec, not the code. A short spec stays readable and current; as a team artifact it scales onboarding, consistency, and auditability. A written spec beats prompting from memory because it is durable, complete, and verifiable — and the sign-off on it always belongs to a human.