Essay·May 2026

Cutting your token bill

Q: Why is my AI token bill so high?

Usually one of a few things: long outputs, which cost five times the input rate on every current Claude tier; resending the same large prompt or context on every request without caching; defaulting all traffic to an expensive model; or running a high effort level on tasks that don't need it. Find which of those is driving your spend before you optimize.

Where the spend comes from, and the five levers that bring it down. Diagnose first, then pull the one that's leaking.

By the benchr team · Reviewed May 30, 2026 · View changelog

A token bill that's climbing faster than your usage almost always traces to a handful of habits, not to the model being expensive. The good news is that the levers are concrete and the savings are large. The trap is reaching for an optimization before you know which part of the bill is bleeding. So start with the diagnosis.

Where the spend comes from

Four causes account for most surprise bills. The first is output length. On every current Claude tier, output costs five times the input rate. Opus is $5 in and $25 out, Sonnet $3 and $15, Haiku $1 and $5. A chatty model that pads every answer with caveats is the most common silent cost, and the one people look at last.

Second is context you resend. If every request ships the same long system prompt, the same few-shot examples, and the same reference document, you're paying full input price to reprocess identical text over and over. The big context windows make this worse, because a 1M-token ceiling is tempting to fill. You get billed per token whether the model needed all of it or not. benchr's look at context windows covers when that long window earns its keep and when it's just expensive retrieval done badly.

Third is the model itself. Defaulting all traffic to a flagship when half of it is classification or extraction is the most expensive habit in the list, and the easiest to fix. Fourth is effort. The newer Opus models expose effort levels above the default, "extra" and "max", that spend more thinking tokens. They're worth it on the hard problems and pure waste on easy ones.

The five levers, in order of payoff

Once you know where your spend concentrates, the fixes are straightforward. Here's the menu, roughly ordered by how much they tend to return.

Token-cost levers and when each one pays off, May 2026
Lever	Typical saving	Best when
Route to a cheaper model	Up to ~80% per task	The task is simpler than your default tier
Prompt caching	Up to 90% on cached input	You resend the same prefix or document
Batch API	50% on input and output	The job isn't time-sensitive
Shorten output	Scales with the cut	Output runs at 5× the input rate
Lower the effort level	Varies	The task doesn't need deep reasoning

Route first. Sending simple work to a cheaper tier is usually the biggest single win. A request that runs fine on Haiku 4.5 at $1/$5 doesn't belong on Opus at $5/$25. The Haiku 4.5 review lays out which tasks the cheap tier handles cleanly and which ones you should route up instead. A good router is the difference between a bill that scales with value and one that scales with vanity.

Then cache. Prompt caching reads previously processed input at about a tenth of the standard price, so up to 90% off the cached portion. It's built for the case where a fixed prefix rides along on every call. There's a small premium to write the cache the first time, recovered after a read or two, and on the newer Opus the minimum cacheable prompt dropped to 1,024 tokens, so shorter prompts qualify now too.

90% What prompt caching can cut from the cost of input you'd otherwise resend in full.

Batch what can wait. The Batch API takes 50% off both input and output for asynchronous jobs. It stacks with caching, so a repetitive overnight workload that's both cached and batched can land near a 95% discount against the naive real-time rate. Anything that doesn't need an answer this second is a candidate.

Tighten output and effort last. These are smaller but free. A system instruction like "answer in one line, then stop" cuts the most expensive token category directly. And dropping the effort level on routine work, or leaning on adaptive thinking, which only reasons when the turn needs it, trims thinking tokens you were burning for no gain.

The biggest savings are structural, not clever. Route the work, cache the prefix, batch the rest.

Put it together

The order matters because the levers compound. Route a task to the right tier, cache the part of the prompt that repeats, batch it if it can wait, and keep the output tight. Each step multiplies against the others, which is how teams get from a scary bill to a boring one without touching quality.

What this looks like priced out by workload, chat, RAG, agents, batch, is the subject of benchr's price-per-use-case breakdown, which puts real numbers on each pattern across several models. Pair that with a tiered routing setup and you've covered the large majority of what there is to save. The rest is rounding error.

Frequently asked

Why is my AI token bill so high?

Usually long outputs, which cost five times the input rate on every current Claude tier; resending the same large prompt on every request without caching; defaulting all traffic to an expensive model; or running a high effort level on tasks that don't need it. Find which one is driving your spend before you optimize.

How much does prompt caching save?

A cache hit costs about a tenth of the standard input price, so up to 90% on the cached portion of your prompt. It pays off whenever you resend the same prefix, like a long system prompt or a fixed document. There's a small write cost the first time, recovered after one or two reads.

Does the Batch API cut costs by half?

Yes. The Batch API applies a 50% discount on both input and output tokens for asynchronous jobs. It stacks with prompt caching, so combining the two on a repetitive batch workload can take the effective rate down by roughly 95%.

Does a bigger context window cost more?

You're billed per token, so sending a large context on every call is expensive whether or not the model uses it. A 1M-token window is a ceiling, not a target. Send only the context a request needs, and use retrieval to pull in the rest on demand.

Can switching models lower my token costs?

It's usually the single biggest lever. Routing simple tasks to a cheaper tier like Haiku 4.5, instead of defaulting everything to a flagship, can cut per-task cost by 80% with no quality loss on work the cheap model handles well. Reserve the expensive model for the calls that need it.

Changelog

May 30, 2026 — Originally published. Caching, batch, and tokenizer figures verified against Anthropic's pricing and model documentation.

References

Anthropic, "Pricing," platform.claude.com, accessed May 2026.
Anthropic, "Prompt caching," platform.claude.com, accessed May 2026.
Anthropic, "Batch processing," platform.claude.com, accessed May 2026.
Anthropic, "Models overview," platform.claude.com, accessed May 2026.

Cutting your token bill

Where the spend comes from

The five levers, in order of payoff

Put it together

Frequently asked

Changelog

References

The price-per-use-case table.

Context windows compared, across four frontier models.

Claude Haiku 4.5, reviewed.