What this guide covers
Three articles, one buying decision. The price-per-use-case table breaks down what each workload actually costs across the major commercial models. The context-windows piece explains why advertised window numbers aren't what you can actually use. The million-token marketing piece argues that most long-context bills are wasted compared to a properly-built retrieval system.
Pricing by workload
-
The price-per-use-case table
Six workloads, three frontier models, the cheapest pick for each. Chat costs $0.014 per turn on Sonnet. RAG queries run $0.036. Document summaries climb to $0.18. Agent sessions can hit $50+ if you don't cap them.
Context-window economics
-
Context windows compared, across four frontier models
Advertised window vs effective retrieval zone. Claude says 1M, retrieves reliably to ~600K. Gemini says 2M, holds to ~800K. The gap matters — most teams price for the advertised window and pay for the effective one.
-
The million-token context was always a marketing number
200K tokens on Claude Opus costs about $1 per query. The same answer via RAG costs $0.06. That's a 17× difference per question. At meaningful volume, the cost structure forces the architecture. Build for retrieval first.
When to skip the frontier entirely
-
RAG vs fine-tuning, with the math
RAG wins almost every time. The three exceptions where fine-tuning earns its place, the math behind each, and the cost breakdown across approaches.
-
Small language models, in working use
Phi-4 mini hits 94% classification accuracy at $0 marginal cost. The 2-point gap to Sonnet 4.6 isn't worth $16 a day in API spend at that volume.
The cost discipline that actually works
Three rules from a year of watching production AI bills run away.
One: constrain output. Cap max-tokens. Force structured formats. Instruct "no preamble" and trim everywhere. Output is where the money goes. See the prompt-engineering piece for the techniques.
Two: cache the prefix. Anthropic, Google, and OpenAI all support prompt caching at ~10% of standard input rate. If your system prompt is the same on every call, you're paying 10× too much by not caching.
Three: route by workload. Use the small local model for classification. Use Sonnet or Flash for routine generation. Save Opus and GPT-5 for the calls that actually justify the spend. The comparison tool helps you scope which model fits which workload.