How Reasoning Models Work: Chain of Thought at Scale

THE CORE IDEA: THINKING BEFORE ANSWERING

Standard language models work in a single forward pass: input goes in, output comes out, with no intermediate deliberation. This works well for tasks where the answer is surface-accessible — summarisation, translation, simple Q&A. It fails on tasks that require working through a problem step by step, where getting the intermediate steps right is what determines whether the final answer is correct.

Reasoning models address this by generating a chain of thought before producing the final answer. The model works through the problem explicitly — exploring approaches, catching contradictions, revising intermediate conclusions — before committing to a response. This scratchpad thinking is what makes them dramatically better at mathematics, logic, and complex code generation.

The thinking happens in a separate sequence of tokens, often called a reasoning trace or thinking budget. This trace is usually hidden from the user — you see the polished final answer, not the messy deliberation. But the quality of the final answer depends heavily on how much the model was allowed to think, which is why reasoning models are slower and more expensive than their standard counterparts.

HOW THEY'RE TRAINED

Training a reasoning model requires teaching it which thinking strategies lead to correct answers. The approach used by most 2025 reasoning models is reinforcement learning from outcomes: the model generates many candidate reasoning traces for a problem, and the traces that lead to correct final answers are reinforced. Over many iterations, the model learns reasoning patterns that generalise to new problems.

DeepSeek's R1 paper, released in January 2025, was particularly influential because it described this process in detail. The paper showed that you don't need hand-curated reasoning traces as training data — the model can discover good reasoning strategies through RL alone, starting from a base language model. This significantly lowered the barrier to training reasoning models and led to a wave of replication attempts across the open-source community.

THE BENCHMARK PICTURE

The improvements on hard benchmarks were substantial. On AIME 2024 (American Invitational Mathematics Examination), standard GPT-4o scored around 9%; o1 scored 74%; o3 scored 96%. On SWE-Bench Verified, a benchmark of real GitHub issues requiring code changes, reasoning models consistently outperformed standard models by 15–25 percentage points. Gemini 2.5 Pro reached 63.8% on SWE-Bench Verified, placing it among the strongest coding models of 2025.

The caveat is that benchmarks measure specific, structured tasks. The gap between reasoning and standard models is largest on tasks that closely resemble the benchmark format — competition math, formal logic, algorithmic coding. On open-ended, ambiguous tasks, the advantage is smaller, and the higher cost and latency of reasoning models may not be justified.

THE COST TRADEOFF

Reasoning models generate significantly more tokens than standard models — the thinking trace can be 5–10x longer than the visible output. This means they're slower (latency of 30–90 seconds for complex tasks is common) and more expensive (often 3–5x the cost of equivalent standard models per task). For interactive applications where users expect sub-second responses, this is a hard constraint.

The practical rule we've landed on: use reasoning models for tasks where correctness has high value and latency tolerance is high — batch code review, mathematical verification, complex data analysis. Use standard models for tasks where throughput and responsiveness matter more — real-time chat, simple extraction, classification at scale. The gap is real, but so is the cost.

THE 2025 LANDSCAPE

By the end of 2025, every major AI provider had at least one reasoning model in production. OpenAI's o3 and o3-mini sat at different cost-performance points in the o-series. Anthropic's Claude 3.7 Sonnet introduced extended thinking that could be toggled on or off per request. Gemini 2.5 Pro offered Deep Think mode for tasks requiring the highest accuracy. DeepSeek R1 provided an open-source alternative at a fraction of the API cost.

The convergence is significant. When every provider has reasoning capability, the question shifts from "which provider has reasoning?" to "which reasoning model is best for my specific task?" That's a healthier question — it focuses evaluation on actual performance rather than feature availability.