1,200 support emails across 18 categories. Phi-4 mini classified them locally at 94% accuracy, and Claude Sonnet 4.6 over the API scored 96% on the same set — a two-point gap. The local side cost nothing. At that volume the API side would have run about $16 a day.
One comparison doesn't make the case for small language models. The dozens of similar comparisons that play out the same way in production every day do. If your workloads involve classification, extraction, or routing — anything with a tight latency budget and a forgiving accuracy ceiling — Phi-4 mini and Gemma 3 deserve the careful look the frontier-model discourse rarely gives them.
This piece covers the best sub-10B-parameter models at the start of 2026, plus a worked example with the timing data. These are the models worth building your production pipeline on when cost, privacy, or latency dominate your constraints. Once your workload depends on multi-step reasoning or wide world knowledge, they stop being the right choice.
(A side note before the category boundaries: Mistral 7B still shows up in production at companies running self-hosted infrastructure. It's not in scope here because the current open-weight tier (Phi-4 mini, Gemma 2 9B, Qwen 3 7B) has materially better quality at the same hardware footprint. If you're still running Mistral 7B and your work is going well, you're fine. The upgrade path is there when you want it.)
What "small" means here
Anything under 10B parameters. Microsoft ships Phi-4 (Azure blog) at 14B and Phi-4 mini at 3.8B, with the mini model card on Hugging Face. Google's Gemma 3 has a 9B and a 27B. Qwen 3 (qwen.ai) has small variants down to 1.5B. The 4B-to-9B band is the sweet spot. It fits in 16GB of RAM with sensible quantization and runs fast enough on a recent laptop to feel interactive. For the hardware side of running this yourself, see running models on your own machine.
On the frontier scale, a 4B model is a rounding error. Claude Opus 4.7 is several hundred times larger by parameter count and many orders of magnitude larger by training cost. The goal here is competence on a narrow band of tasks, not catching up to the frontier — tasks where frontier capability is overkill and where small means much faster, much cheaper, and easier for you to control.
The marketing framing for both products has Gemma 2 9B beating Phi-4 mini on multilingual work. Gemma does win on that axis in the community reports. The gap is smaller than the marketing implies; Phi-4 mini's performance on Spanish and French structured-extraction tasks is close enough that the size advantage matters more than the framing predicts.
Three workloads where small models beat the frontier
Classification and extraction. For your routing, triage, and structured-extraction work, small models hit roughly 94% of frontier accuracy at one-tenth the cost. A two-point gap rarely justifies ten times your API bill.
Routing and triage. The model decides where your request should go. Which API to call, which downstream model to invoke, which template to apply. Small models excel here because your task is simple, your latency budget is tight, and the cost of getting it wrong is recoverable. A small model in the router slot lets you save the frontier models for the requests that need them.
On-device or private inference. Anything where your data can't leave the device (health records, internal corporate documents, anything covered by a strict residency rule) and where the capability ceiling is acceptable. A 9B Gemma 3 running locally on your laptop is more useful than a frontier model your team isn't allowed to use.
Where small loses
Multi-step reasoning. Ask a 4B model to chain three or four logical steps and the failure rate jumps sharply. The model can do each step individually. It loses coherence across the chain. Frontier models hold the chain together more reliably, which matters for any task in your stack that requires planning.
World knowledge. Small models simply know less. Ask Phi-4 mini an obscure question about regional tax regulations or the history of a niche programming language, and the answer will be smooth, confident, and often wrong. This is where parameter count maps most directly to knowledge breadth, and there's no clever workaround.
Long-context retrieval. Most small models advertise 128K-token windows, and the retrieval quality at the high end of that window is far worse than the frontier models. For any work in your stack that needs deep reasoning over a long document, a small model is the wrong tool. The context-window piece covers the long-context picture in detail.
The 4B sweet spot is the workhorse tier. Save the frontier models for the problems no workhorse can carry.
The two worth picking
Phi-4 mini (Microsoft, released late 2025) at 3.8B parameters. The strongest small model on structured reasoning and instruction-following at that size. Microsoft's training-data strategy (synthetic data filtered for educational value) is a concrete edge on tasks where the input looks like a textbook problem or a structured business document. The license (MIT) is the cleanest available.
The larger Phi-4 at 14B isn't in the "small" bucket strictly. It sits on the boundary and is worth pairing with mini if your workload mixes simple and structured-reasoning tasks. Same MIT license.
Gemma 2 9B (Google, released October 2025). Best raw multilingual capability in the small-model class, including clearly better Arabic than expected. The Gemma Terms license is permissive enough for commercial use with sensible restrictions. The instruction-tuned variant follows specified formats more reliably than the base.
| Model | Params | License | Best at |
|---|---|---|---|
| Phi-4 mini | 3.8B | MIT | Classification, extraction, structured tasks |
| Phi-4 | 14B | MIT | Structured reasoning at the edge of "small" |
| Gemma 2 9B | 9B | Gemma Terms | Multilingual workhorse, on-device chat |
| Qwen 3 7B | 7B | Apache 2.0 | Code in small footprint, Chinese |
| Llama 3.1 8B | 8B | Llama 3.1 Community | General-purpose, ecosystem familiarity |
Classification
Phi-4 mini Email, support ticketsExtraction
Phi-4 mini Structured fields from textRouting
Phi-4 mini Decide which API to callSummarization
Phi-4 Short docs, single passMultilingual
Gemma 3 9B, Arabic-decentCode helper
Qwen 3 7B Coding small-footprintEmail, ticket, document, query.
Phi-4 mini classifies + decides path.
~90% of cases. Zero API cost.
~10% of cases. Pay for what matters.
The small-model accuracy numbers are workload-specific. For inbox triage with 18 well-defined categories, Phi-4 mini hits 94%. For free-form sentiment analysis on social media text (closer to fine-grained nuance) the community has seen the same model drop to 78%. The 94% number is a ceiling, not a floor.
A concrete production scenario
A concrete example to make the trade less abstract. An inbox-classification pipeline that previously ran through Claude Sonnet 4.6 got rebuilt to run on Phi-4 mini locally. The setup classifies incoming emails into 18 priority categories.
Before-and-after numbers:
| Metric | Sonnet via API | Phi-4 mini local |
|---|---|---|
| Cost per email | ~$0.004 | ~$0 (electricity) |
| End-to-end latency | ~800 ms | ~60 ms |
| Accuracy vs. human labels | 96% | 94% |
| Data leaves premises | Yes | No |
Accuracy dropped two points. Cost dropped to basically zero. Latency dropped by more than an order of magnitude. The data residency story changed from "leaves the network" to "stays put."
For this workload, the trade is obvious.
For a sales-lead routing system where a misclassification has dollar consequences, the trade would tip the other way and the frontier API would stay. Small models open a different operating point on the cost-accuracy curve. The right question isn't which model is better. It's which operating point fits the workload. For the broader pricing picture across workloads, see price per use case.
One gap remains. None of these small models were fine-tuned on workload-specific data, which would probably close part of the accuracy gap on the support-email task. Possibly enough to recover the two-point drop. The multimodal variants weren't tested here either. Both are open questions for follow-up.
Small language models won't be the future of frontier capability. They will be the future of production AI infrastructure.
The workloads they're winning — classification, extraction, routing, on-device inference — are exactly the ones that account for most of the API spend in actual businesses. A company running millions of inference calls a day through a frontier model, when 90% of those calls could be served by a 4B model, is leaving money on the table.
For any organization with serious volume, the right architecture is two-tier: a frontier model for the requests that justify it, and a small model (fine-tuned where useful) in front of every other layer. Once volume is real, the cost dynamics and the latency wins are too big to ignore.
For English-only structured work, go with Phi-4 mini. For multilingual work, go with Gemma 2 9B. Both are good enough that the real question is where in the stack to use them. The frontier models keep the prestige; the small models do the work.