Do you actually need a reasoning model?

When the extra cost and latency of a thinking model pays off, and when it's wasted.

· View changelog · Figures verified against official sources, 30 May 2026

You're about to send a request. There's a toggle next to it that says, more or less, "think harder." Flip it on and the model will reason longer before it answers. You'll pay more and you'll wait longer. The question in front of you isn't philosophical. It's whether this particular call is worth the upcharge.

That's the whole decision, and it's worth getting right, because you make it constantly. The mistake is treating it as a provider choice you set once. It's a per-call call. Some requests deserve the deep think. Most don't.

What you're actually buying

A reasoning model, or "thinking" model, runs a stretch of internal reasoning before it writes the answer you see. You don't get to read most of that work through the API, but you pay for all of it. OpenAI puts it flatly: reasoning tokens aren't visible, they still take up room in the context window, and they're billed as output tokens.

Here's the part that catches people. Output is the expensive token category, and thinking is made of output tokens. On Claude, output bills at five times the input rate, with Opus at $5 per million tokens in and $25 out, Sonnet 4.6 at $3 and $15, and Haiku 4.5 at $1 and $5. So a request that emits, say, 2,000 hidden thinking tokens before a 500-token reply is billed on all 2,500 at the $25 rate. You can't see four-fifths of what you're paying for.

It's not a Claude quirk. Google's Gemini pricing page labels its output column "including thinking tokens." Anthropic charges for the full internal thinking, not the short summary it shows you. The billed output count won't match the visible response, and trimming the summary cuts your latency but not your cost.

So two separate bills come due when you flip the toggle: dollars and seconds. The dollar cost is obvious once you know thinking bills as output. The time cost is sneakier, because the model is busy before it shows you a single word. On a chat interface, a multi-minute pause isn't a slow answer. It's a broken one.

The buy-or-skip table

Here's the call laid out by task type. The pattern is simple: reasoning earns its tax when the problem has real steps and a wrong answer is costly. It's dead weight on anything you could check at a glance.

When a reasoning model is worth its cost and latency, by task type, May 2026
TaskReasoning worth it?Why
Competition-level math, proofs, hard logicYesAnthropic reports its biggest extended-thinking gains here; the steps are real and a wrong step ruins the answer
Multi-file or agentic codingYesChained decisions across files; competition coding is on Anthropic's list of where thinking helps most
In-depth planning and analysisYesThe work is genuinely multi-step, and the upside of getting it right outweighs the token tax
Simple Q&A and chatNoOptimalThinkingBench found models burn 700+ tokens overthinking easy questions with no accuracy gain
Data extraction and document parsingNoLlamaIndex saw cost and latency rise 5–8x while quality stayed flat near 0.79; a non-reasoning parser scored higher
Classification and formattingNoSingle-step, easy to verify; thinking tokens add bill and delay without changing the output

Buy the think when the task is hard and the steps are real. Anthropic's own list of where extended thinking pays is math, physics, competition coding, and in-depth analysis, and that's the list to trust. These are the problems where chained reasoning changes the answer, not just the token count. Pay for thinking exactly where it moves the needle.

Skip it for anything you can eyeball. The evidence here is blunt. Meta FAIR and Carnegie Mellon's OptimalThinkingBench found thinking models routinely burn more than 700 tokens chewing on simple questions, where the most efficient model in the test averaged around 135, with no accuracy payoff. The model isn't thinking. It's stalling, on your dime.

Document parsing makes the trap concrete. In a controlled LlamaIndex test, dialing reasoning up moved cost from about $0.029 to $0.246 and time from 47.89 seconds to 241.70 seconds per task, while quality sat flat near 0.79 the whole way. A non-reasoning agentic parser actually scored higher, at 0.821. You paid five to eight times more, waited five times longer, and got a slightly worse result. That's the worst trade in the building.

On easy work, reasoning doesn't buy you a better answer. It buys you a bigger bill and a longer wait.

If you're chasing this from the cost side, benchr's guide to cutting your token bill treats effort level as one of the four big leaks, and the price-per-use-case breakdown puts dollar figures on what each workload pattern actually runs. Reasoning is the same lever seen from the other end: it's the most expensive token you can buy, so only buy it when it changes the answer.

A default that won't burn you

If you're not sure, don't reach for the most powerful model. Match it to the task. Anthropic's own cost guidance is the cleanest version of this: Haiku for simple tasks, Sonnet for most production workloads, and Opus only for the most complex reasoning. The default tier should be fast and cheap, not the heavy one with thinking maxed out.

The setup that scales is a router. Send most traffic to a quick, non-reasoning tier and escalate only the genuinely hard slice to a reasoning model. Most real workloads are a fat majority of easy requests and a thin minority of hard ones, and a router lets you pay the thinking tax on just the minority that earns it.

One more caution on the accuracy side: a reasoning model isn't a truth machine. Thinking can reduce slips on hard logic, but it doesn't make a model stop making things up, and on easy questions the extra tokens often just produce a more confident wrong answer. benchr's look at where hallucinations stand in 2026 covers what thinking does and doesn't fix. And if you've been picking models off leaderboard scores, why benchmarks stopped telling you much is worth a read before you let a reasoning headline number set your default.

The verdict is short. Reach for reasoning when the problem is hard, multi-step, and costly to get wrong. Stick with a fast standard tier for everything else, which is most of what you send. The toggle is yours to flip per request, so flip it like it costs you, because it does.

Frequently asked

What is a reasoning model?

A reasoning model, also called a thinking model, generates hidden internal reasoning before it writes the answer you see. Those reasoning tokens aren't shown through the API, but they fill the context window and they're billed. OpenAI says it plainly: reasoning tokens aren't visible, they occupy the context window, and they're billed as output tokens. The promise is more reliable answers on hard, multi-step problems in exchange for that extra work.

Do reasoning tokens cost more than a standard answer?

Usually yes, often a lot more. Thinking tokens bill as output at all three major providers: OpenAI says so verbatim, Google labels its output price as including thinking tokens, and Anthropic charges for the full internal thinking, not just the summary. On Claude, output runs at five times the input rate, with Opus at $5 per million in and $25 out, so a few thousand hidden thinking tokens are charged at that higher rate. OpenAI even tells you to reserve up to 25,000 output tokens for one reasoning response.

When should I turn on thinking mode?

When the task is genuinely multi-step and a wrong answer is expensive: competition-level math, proofs and hard logic, multi-file or agentic coding, and in-depth planning or analysis. Anthropic reports the biggest gains from extended thinking in math, physics, competition coding, and detailed analysis. If the work needs real chained reasoning, paying for thinking tokens buys accuracy you can't get any other way.

Is a reasoning model worth the latency?

Only on hard problems. The latency can be severe, because thinking happens before the visible answer. In LlamaIndex's document-parsing test, dialing reasoning to its highest level pushed processing from about 48 seconds to roughly 242 seconds per task with no accuracy gain. On a hard, high-stakes problem that wait can pay off. For anything chat-style or real-time, it breaks the interaction.

Which tasks do not need a reasoning model?

Simple Q&A, data extraction, classification, formatting, and chat. Meta FAIR and CMU's OptimalThinkingBench found thinking models burn more than 700 tokens overthinking simple questions with no accuracy gain. LlamaIndex's parsing test saw cost and latency rise five to eight times with reasoning turned up while quality stayed flat near 0.79, and a non-reasoning parser scored higher at 0.821. On easy work, reasoning mostly buys a bigger bill and a slower response.

Changelog

  • May 30, 2026 — Originally published. Pricing and thinking-token billing verified against OpenAI, Google, and Anthropic docs; cost and latency figures verified against LlamaIndex and the OptimalThinkingBench paper.

References

  1. Anthropic, "Pricing," platform.claude.com, accessed May 2026.
  2. Anthropic, "Building with extended thinking," platform.claude.com, accessed May 2026.
  3. OpenAI, "Reasoning models," developers.openai.com, accessed May 2026.
  4. Google, "Gemini Developer API pricing," ai.google.dev, accessed May 2026.
  5. Anthropic, "Claude 3.7 Sonnet," anthropic.com, accessed May 2026.
  6. LlamaIndex, "The Cost of Overthinking: Why Reasoning Models Fail at Document Parsing," llamaindex.ai, accessed May 2026.
  7. Aggarwal & Saha et al. (Meta FAIR / CMU), "OptimalThinkingBench," arxiv.org, accessed May 2026.