benchr Issue No. 07

RAG vs fine-tuning, with the math

Cost numbers across both approaches, and the three specific scenarios where fine-tuning still pays off.

· View changelog

RAG per query $0.04 Sonnet 4.6 with retrieval
Long-context same query $3.00 200K tokens, Opus
First-token latency 300ms RAG round-trip + model
Fine-tune cost $180 For a 1K-example dataset

RAG won the architecture war in 2024. Most teams just haven't admitted it yet.

A pipeline that produces structured changelog entries from pull request descriptions. Base-model RAG hits format-valid output about 95% of the time. A fine-tune on 800 examples pushes it to 99.4% and makes the residual failures predictable enough for a validator to catch. Training cost about $180. Inference is now free per call. That's the kind of swap where fine-tuning earns its place. And one of only three places in 2026 where it does.

The rest of the time, RAG wins. The cost numbers favor it. The operational simplicity favors it. The auditability favors it. The pages that follow are an argument for that default, with the three exceptions named in full and the cost math behind each. For the broader cost picture across workloads, see the price-per-use-case table.

I expected this piece to argue more strongly for fine-tuning than it ended up arguing. Going in, I had three production cases where fine-tuning had paid off. I went looking for more cases during the research. The cases I found were either RAG cases in disguise or workloads where prompt engineering closed the gap. The three-case taxonomy below isn't maximum coverage — it's the honest count.

What each approach actually does

RAG, in its basic form, retrieves relevant information at query time and stuffs it into the prompt. The model's weights don't change. The capability change is contextual. The model gets new facts to work with on each query.

Fine-tuning adjusts the model's weights based on a training set of input-output pairs. The model permanently learns to produce outputs of a particular shape, in a particular style, or against a particular set of constraints. New facts taught via fine-tuning are baked in. New facts that come up after training are invisible to the fine-tune.

These two approaches usually get framed as alternatives. They're alternatives only in specific situations. For most workloads, RAG addresses a problem fine-tuning can't solve, and fine-tuning addresses a problem RAG can't solve.

Why RAG wins most of the time

Three reasons, in order of weight.

RAG handles updates gracefully. Your knowledge base changes weekly in most real businesses. Re-indexing a vector store on a fresh batch of documents is a 20-minute job. Re-training a fine-tuned model on new data is a several-hour job and costs hundreds of dollars per iteration. The operational asymmetry is big.

RAG is auditable. You can inspect the retrieved chunks for each query. When the model produces a wrong answer, the cause traces to either the retrieval step or the generation step, and you can debug accordingly. Fine-tuned models are opaque. When they're wrong, the cause is a guess, and your only response is more training, which may or may not fix the underlying problem.

The cost math heavily favors RAG at sub-millions-of-queries-per-day volumes. The base model's per-token cost — verified against Anthropic's pricing page and OpenAI's API pricing — is real but stable. Retrieval cost is a few hundred milliseconds and a fraction of a cent. Total per-query cost stays under a cent for most workloads. Fine-tuning has a real up-front cost that only amortizes at very high query volumes. For the long-context alternative, see context windows compared.

RAG vs Fine-tuning vs Long-context — five dimensions /100

Higher is better. Lower-better dimensions (cost, latency) inverted for the chart.

RAG — cost score
95
Fine-tune — cost score
80
Long-context — cost score
20
RAG — flexibility
95
Fine-tune — format compliance
98
Long-context — synthesis
92
27× RAG is 27× cheaper than long-context for precise lookup

One honest admission: I'm not certain the three-case taxonomy below is complete. These are the three cases I've seen fine-tuning earn its keep in production work. Fourth and fifth cases probably exist — agent-routing models that need to hit a specific decision distribution, for example — but I haven't tested them carefully enough to write about with confidence.

The three cases where fine-tuning wins

Each of these is named because each is a real scenario where the answer is fine-tuning, and the cost of choosing RAG instead is real.

Case one: strict output format compliance. Your application needs the model to produce a precisely-structured output every single time. A JSON schema with no deviation, a domain-specific markup format, a structured table with exact column ordering. With prompting and few-shot examples, the major frontier models get this right around 95% of the time. The remaining 5% is unrecoverable for some applications. Fine-tuning on 500 to 2,000 examples can push compliance to 99%+ and make the residual failure modes predictable enough to handle with a simple validator.

A real example: a pipeline producing structured changelog entries from pull request descriptions. The base-model approach got the format wrong often enough to need downstream cleanup on roughly one in twenty entries. The fine-tuned approach reaches 99.4% schema-valid output with the residual 0.6% caught by a simple validator. Training cost was about $180. The operational simplification has been worth a lot more.

Case two: domain-locked voice or style. Your application needs the model to write in a specific voice no amount of prompting reliably enforces. Brand voice for marketing copy. A legal team's writing conventions. A code-comment style consistent across a large codebase. Fine-tuning on a curated set of examples of the desired voice produces output that drifts less and needs less editing than prompting alone.

The key word is reliably. Prompting can get the right voice 80% of the time. Fine-tuning reaches 95%+. If the cost of the residual gap is high (every output edited, every output reviewed), the math tips toward fine-tuning quickly.

Case three: latency-critical hot paths. Your application has a query path with a strict latency budget (a few hundred milliseconds end-to-end) and the retrieval step in RAG eats too much of it. A fine-tuned model with the relevant knowledge in its weights can serve the request without the retrieval round-trip. For real-time apps — voice assistants, in-game NPCs, certain financial workflows — that's the only viable architecture.

The trade-off is real. The fine-tuned model is now a snapshot in time, and any knowledge update needs re-training. For latency-critical apps where the knowledge changes slowly, that's acceptable. For apps where the knowledge changes weekly, it isn't.

The three cases for fine-tuning are real. The mistake is to apply them to a problem that's actually a fourth case in disguise.

The case people keep asking about

The most-asked question is some variant of: I have a corpus of internal company documents. Should I fine-tune a model on them or build RAG? The answer is almost always RAG. The corpus isn't what determines the answer. The use case is.

If the use case is letting employees ask questions about the documents, go with RAG. The knowledge changes. You want audit. Updates need to be easy.

If the use case is generating documents in the company's writing style, fine-tune. The style is the central requirement. The underlying knowledge can still be supplied via context.

If the use case is both, the answer is RAG plus a light fine-tune on style. The fine-tuned model handles voice. The retrieval layer handles facts. Each approach does what it's best at.

Cost breakdown, January 2026 prices

RAG vs. fine-tune cost, January 2026 prices
ApproachSetup costPer-query costUpdate cost
RAG on Claude Sonnet 4.7~$200 (DB + dev time)$0.04 / query$10 (re-embed batch)
RAG on GPT-5 Mini~$200$0.008 / query$10
Fine-tune of GPT-4o-mini (1k examples)~$25 + dev$0.001 / query~$25 per re-train
Fine-tune of Llama 4 8B (1k examples)~$60 GPU time + dev$0 (self-hosted)~$60 per re-train
Fine-tune of Claude (enterprise tier)Several thousandVariableSeveral thousand

The interesting line is the second-from-bottom. A fine-tune of a small open-weight model gives you a zero-marginal-cost inference path on your own hardware. For high-volume, narrow workloads, that's the cost-optimal architecture in 2026. The trade-off is the operational burden of running the inference yourself, covered in running models on your own machine.

1. User query

A question or instruction.

2. Embed → search

Vector store finds the K most-relevant chunks.

3. Retrieve top chunks

Typically 3–5 chunks, 4K tokens total.

4. Generate with context

Grounded answer, citable, $0.04 per query.

Knowledge changes weekly?

RAG Re-embed in 20 min

Strict output format?

Fine-tune Push to 99%+ compliance

Specific voice/style?

Fine-tune Prompting only gets 80%

Sub-300ms latency?

Fine-tune No retrieval round-trip

Cross-document synthesis?

Long context Worth the cost

Auditability matters?

RAG Inspect retrieved chunks

It's not pretty, but it works.

The default sequence

For a typical small team building a domain-specific AI feature, the recommended sequence:

  1. Start with base-model RAG on Claude Sonnet 4.7 or GPT-5 Mini. Measure failure modes.
  2. If failures concern facts or staleness, improve retrieval.
  3. If failures concern format compliance, try few-shot prompting first. If that doesn't close the gap, fine-tune.
  4. If failures concern style, prompt-engineer aggressively first. If that fails, fine-tune on a curated style corpus.
  5. If failures concern latency, profile the retrieval step before assuming a fine-tune is the answer.

This sequence ships faster, costs less, and produces a system you can debug. The mistake is starting with fine-tuning because it sounds more sophisticated. Sophistication isn't the goal. A system that works is the goal.

Two gaps to flag before the close. The distillation feature on OpenAI's platform docs, meant to make it cheap to fine-tune a small model on the outputs of a larger one, wasn't stress-tested here. A controlled comparison of fine-tuning approaches (LoRA versus full versus prompt tuning versus distillation) is also pending. The working wisdom is that LoRA is enough and a lot cheaper, but that deserves its own piece.

RAG wins almost every time. If you're building a knowledge-grounded AI feature in 2026 and you haven't built the RAG version first, you're optimizing for the wrong thing. The operational simplicity, the auditability, and the per-query cost dynamics all favor RAG at sub-millions-of-queries-per-day scale.

Fine-tuning wins in three cases, and only three. When output format compliance has to hit 99%+ reliability. When voice or style is central and prompting can't reliably enforce it. When latency budgets forbid the retrieval round-trip. In each case, the answer is usually a combination: fine-tune for what fine-tuning does well, retrieve for what retrieval does well.

If your team is fine-tuning because someone said you should, stop. Audit the actual failure modes of the base model on the task, and pick the right tool for what's actually broken. Most of the time, what's broken is the retrieval, the prompt, or the evaluation. Fine-tuning is a real option, but it's a smaller fraction of real-world AI work than the discourse suggests.

Bottom line

Default to RAG, for the workloads I've evaluated. Fine-tune only for strict output format compliance, domain-locked voice, or sub-300ms latency requirements. Long context is for genuinely cross-cutting questions. Most teams that fine-tune should be running RAG. Most teams running RAG should still be paying attention to which model the retrieved chunks go into.

Frequently asked

RAG or fine-tuning — which should I use?

RAG almost every time. The cost dynamics, the operational simplicity, and the auditability all favor RAG. Fine-tuning wins in three specific cases: strict format compliance, domain-locked voice, and sub-300ms latency requirements.

How much cheaper is RAG vs fine-tuning?

Per-query, RAG runs about $0.04 on Sonnet 4.6 with retrieval. A fine-tuned small model runs essentially zero marginal cost after the ~$60-180 training run. Fine-tuning wins on per-query cost at scale — but only for the workloads it's actually right for.

When does fine-tuning beat RAG?

Three cases. One: strict output format where compliance must hit 99%+ (RAG hits ~95% on prompting alone). Two: a specific voice or style prompting can't reliably enforce. Three: latency hot paths where the retrieval round-trip is too slow.

Can I combine RAG and fine-tuning?

Yes, and you usually should. Fine-tune for voice and format. Use RAG for facts. Each approach handles what it's best at. Most production systems that use fine-tuning correctly are running it alongside RAG, not instead of it.

How long does it take to fine-tune a model?

Several hours on a typical 1,000-example dataset. Cost: $25-$180 depending on the base model. Re-training on updated data costs the same again. RAG re-indexing on the same knowledge update takes 20 minutes and a few cents.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Updated cost ratios with Q1 2026 pricing.
  • April 17, 2026 — Originally published.

References

  1. OpenAI, "Platform documentation," platform.openai.com/docs, accessed May 2026.
  2. OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
  3. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  4. Anthropic, "Pricing," anthropic.com/pricing, accessed May 2026.