Skip to content
Prompt engineering

Prompt engineering · · 7 min read

Chain-of-thought vs structured outputs: when to use which

When you need reasoning and when you need rigid JSON. The tension, the costs and a decision guide.

Chain-of-thought (CoT) and structured outputs are two tools that pull a model in opposite directions. CoT wants the model to think out loud — verbose, free-form, step by step. Structured output wants the model to lock its answer into a rigid shape — JSON that passes schema validation. That tension is real, and it costs you money, latency and reliability if you choose wrong. This piece breaks down both approaches and gives a concrete decision guide for people building on the API.

What chain-of-thought is and when it actually helps

CoT is a prompt that asks the model to generate reasoning steps before the final answer. The classic “think step by step”, but today it goes much further: reasoning models produce a separate thinking stream before emitting a user-facing answer.

CoT helps when the task needs multi-step reasoning: maths and arithmetic, debugging logic (“why does this code return undefined”), planning a sequence of agent actions, comparing many conditions, legal inference or diagnostics. Where the answer is a function of a chain of intermediate conclusions, forcing that chain genuinely raises accuracy. The model does not “guess” the ending — it derives it.

CoT does not help on single-step tasks: classifying a short text, extracting a field from an invoice, a yes/no decision from one rule. There, reasoning adds tokens, latency and sometimes noise that lowers quality. Generating 400 tokens of “thoughts” to return one label is waste.

What structured outputs are

Structured outputs are mechanisms that force the model to return data in a predictable shape. Three main forms:

  • JSON mode — the model guarantees syntactically valid JSON, but with no control over specific fields.
  • response_format with a schema (e.g. JSON Schema / constrained decoding) — the model guarantees JSON conforming to{"type":"object","required":["score"]}. This is grammar-constrained decoding: the sampler rejects tokens that would break the schema.
  • Function calling / tool schemas — the model picks a tool and fills its arguments per the declaration, e.g. get_weather(city: string).

The benefit is obvious: a production pipeline gets data that drops straight intoJSON.parse and Zod validation without regex parsing. No “the model added a sentence before the brace”. Reliability at the level of a contract, not a hope.

Where the tension lies

Here it gets interesting. Constrained decoding limits the token space the model may generate — at each step only tokens that lead to valid JSON are allowed. But reasoning is by nature free-form. If you force the model to emit {"answer": from the very first token, you take away its room to think. The effect: on hard maths or logic tasks quality can drop, because the model “jumps to the answer” instead of deriving it.

This is not theory. Anyone who has wired a rigid response_format schema into a task requiring analysis has watched answers turn shallow. The model gets a shape but loses the process. And on reasoning tasks the process is the quality.

How modern models let you combine them

Good news: you do not have to choose. There are three proven patterns for combining one with the other.

Pattern 1: reasoning, then a structured field

A schema with two fields: {"reasoning": string, "answer": ...}, wherereasoning comes first. The model fills it with free text (it has room for CoT) and then — already “having thought” — produces the structuredanswer. Field order in the schema matters: reasoning must come before the answer, because the model generates sequentially. A reasoning field afteranswer is useless — that is rationalization, not reasoning.

Pattern 2: reasoning tokens + a structured answer

Reasoning models separate “thinking” from “answer” at the API level. The model thinks freely in a separate stream, and the final user-facing answer can be constrained by a schema. You get full CoT and clean JSON on output, with no compromise. This is today the strongest option for hard tasks that must land in a pipeline.

Pattern 3: two calls

Call A: free-form CoT, returns text. Call B: takes that text and “translates” it into a strict schema (a cheap, single-step extraction). More expensive and slower (two round-trips), but it gives full control and works on any model, including ones with no native reasoning mode. A good fallback for older models and full error isolation: if parsing fails, you know call B is to blame, not the reasoning.

Latency and cost

Every reasoning token is time and money. CoT can multiply output tokens 5–20× compared with a bare answer. With reasoning models, reasoning tokens are usually billed like output and can be expensive. Three consequences:

  • Latency: longer output means longer time to last token. For synchronous UI (chat, autocomplete) that can be a killer.
  • Cost: a task with CoT can cost several times more for the same input. At a million requests a month that is a real budget line.
  • The two-call pattern doubles network round-trips — you add connection latency overhead, not just generation.

Rule: do not pay for reasoning where you do not need it. Classification on a small model with no CoT is often 10× cheaper and faster, with identical quality.

Reliability in production pipelines

Production has a different bar than a prototype. Here what counts is how many requests per 10,000 need manual intervention. Pure CoT returning text is brittle: you must regex-parse free form, and every so often the model changes formatting and the parser falls over. Structured output with constrained decoding eliminates that bug class — the JSON is validby definition.

So for any output that feeds another system (a database, a queue, the next agent step), default to structured output. Leave free text for output a human reads. And when you need bothreasoning quality and shape reliability — combine them with pattern 1 or 2. Always validate the result with Zod at the system boundary, even with constrained decoding: the provider's schema and your domain contract are not the same thing.

Decision guide

Use CoT when: the task is multi-step (maths, logic, debugging, agent planning), accuracy rises with explicit reasoning, and either a human reads the output or you have the token and latency budget. Warning sign: if you cannot name the intermediate steps, CoT probably buys you nothing.

Use structured output when: the result feeds a pipeline, the task is single-step (extraction, classification, routing, filling tool arguments), and parsing reliability and low latency are what matter. A small model plus a schema is often the cheapest and most dependable solution.

Combine when: you need reasoning and a machine-readable output. Pick reasoning tokens + a structured answer (pattern 2) if the model supports it. Otherwise a{reasoning, answer} schema with reasoning first (pattern 1). Keep the two-call pattern as a fallback for older models or when you need full error isolation.

Common mistakes

  • A reasoning field after answer in the schema. The model already answered — the rest is theatre. Reasoning always first.
  • A rigid schema on a hard task with no room to think. You get a shape, you lose quality. Add a CoT field or use reasoning tokens.
  • CoT on a classification at 400 tokens where one label would do. You pay for noise.
  • No Zod validation “because the provider schema guarantees JSON”. It guarantees syntax, not your domain contract or value ranges.

TL;DR

CoT raises quality on multi-step tasks but costs tokens and latency. Structured output gives parsing reliability but a rigid schema chokes reasoning. For single-step pipeline tasks — structured. For reasoning a human reads — CoT. When you need both, combine them: reasoning tokens + a structured answer, or a schema with a reasoning field beforeanswer. And always validate the result at the system boundary.

Chain-of-thought vs structured outputs: when to use which | vibecoding.pl