benchr Issue No. 07

The open-weight tier right now: Llama 4, Mistral, Qwen, DeepSeek

Where open weights have caught up to closed models, and the two categories where they still haven't.

· View changelog

Model families 4 Llama, Mistral, Qwen, DeepSeek
Cleanest license Apache Qwen 3, MIT for DeepSeek
Cheapest input $0.10 Llama 4 Scout / 1M tokens
Top open SWE-Bench 58% DeepSeek-V3.1

"Open-weight has caught up." That's the take heard most often in 2026, usually from people who haven't tried running an open model in a serious production workflow. The opposite take — "open-weight is still years behind" — comes from people who haven't looked at the leaderboards in a while. Both are wrong because both average across categories that have moved at very different speeds. That they've caught up to closed models. That they're still way behind. Both average across categories that have moved at very different speeds. Open weights have closed the gap on general conversational use, on isolated code generation, and on high-resource multilingual reasoning. They haven't closed the gap on long-context retrieval at extreme scale or on reliable tool use inside agent loops. The four models covered here — Llama 4 in Maverick and Scout configurations, Mistral Large 2, Qwen 3, and DeepSeek-V3.1 — show this unevenness in different ways.

The lineup: Llama 4 in its two main shipped configurations (the 405B-parameter dense Maverick and the 88B mixture-of-experts Scout), Mistral Large 2 at 123B dense parameters, Qwen 3 in both the 72B dense and 235B MoE variants, and DeepSeek-V3.1 at 671B MoE with roughly 37B active per token. All four ran on real workloads through the past three months, both via hosted endpoints and via local inference on workstation-grade hardware. If you want to run any of these yourself, see running models on your own machine.

One genuine uncertainty before walking through the lineup: I can't fully answer whether DeepSeek-V3.1 is better than V3.0 at math in the ways that matter. The benchmark improvements are real and reproducible. The improvement on the actual math problems I care about — finance modeling, specifically — is marginal. The benchmark gap implies a bigger real-world gap than I've been able to confirm.

Llama 4

Meta released Llama 4 in September 2025, per Meta AI's Llama 4 announcement, in two main configurations. Maverick is the 405B dense model intended for serious GPU deployment. Scout is the 88B mixture-of-experts variant that activates roughly 22B parameters per forward pass and runs on a single 80GB H100 with sensible quantization. Weights and downloads are available at llama.com.

Maverick is the strongest open-weight model on hard reasoning tasks at the end of 2025. On chain-of-thought problems and structured multi-step reasoning, it's the open-side model to reach for. Scout is the workhorse. It gives up some peak capability for way more accessible hardware requirements. The instruction-tuned variants of both are good but feel slightly less refined than the base models, a pattern consistent with how Meta has been tuning lately.

The license is the Llama 4 Community License. Permissive for almost everyone, with a clause forbidding use by services with more than 700M monthly active users. For a small team or a working developer, that's irrelevant. For a large company, read the license carefully against the specific deployment context.

Worth flagging: Mistral Large 2 is the production line in May 2026, but Mistral has been hinting at a successor for months. By the time you read this, there may be a Large 3 or equivalent. The methodology here applies regardless — score the new model against the same tests when it ships.

Mistral Large 2

Released July 2024 at 123B dense parameters, per Mistral's Large 2 announcement. Still the open-weights lab with the strongest house style. A clean, structured output preference, a willingness to commit to opinions instead of hedging endlessly, and clearly stronger European-language work than the alternatives. The context window sits at 128K tokens, but the context the model does have is unusually well-used.

The license is the Mistral Research License. Permissive for research and personal use, with separate commercial terms required for paid deployments. It's not as clean as Apache 2.0, but the terms are straightforward and predictable. If your deployment is internal and non-commercial, you can use Mistral Large 2 today without further negotiation. For commercial use, contact Mistral.

Qwen 3

Alibaba released Qwen 3 in October 2025 across several variants, with the official rundown at qwen.ai and model cards hosted on Hugging Face under the Qwen organization. The two worth your attention are the 72B dense model and the 235B MoE that activates about 32B parameters per token. Qwen 3 is the strongest open-weight model on Chinese-language work, and one of the better ones on Arabic and Japanese. The code understanding is competitive with mid-tier closed models in a way that surprises anyone who only knows Qwen by its earlier reputation.

Apache 2.0 license on most variants. License clarity is the cleanest of the lineup. The model's instruction-following tends to drift back to its preferred output shape after a few turns of conversation, which is a limitation in agentic workflows. For single-shot or short-conversation use, the quality matches or beats the alternatives across most categories.

DeepSeek-V3.1

Released in late 2025 as a refinement of the V3 line — releases and docs at deepseek.com. The V3.1 update sits at 671B-parameter mixture-of-experts that activates roughly 37B parameters per forward pass. DeepSeek has built the most aggressive open-weights story of any current lab. Detailed technical reports, model cards with real numbers instead of marketing language, and hosted-endpoint pricing way below the Western alternatives.

For coding and math, DeepSeek-V3.1 is competitive with Claude Opus 4.7 on isolated tasks. The reasoning quality on math problems is the strongest in the open-weights field. The English-language writing is at the top tier. The weaknesses are in tool use (less reliable than the closed alternatives) and in safety-tuning depth (refusals are clearly lighter than what Western users may expect from a frontier model).

The license is the MIT-style DeepSeek License. Permissive with use-case restrictions worth reading if the deployment touches anything sensitive.

Capability average across six dimensions

Capability average across coding, reasoning, writing, vision, long-context, multilingual.

Llama 4 Maverick
85
DeepSeek-V3.1
82
Qwen 3 235B
83
Mistral Large 2
79
Claude Opus 4.7 (ref)
91
70% Of frontier capability for about 10% of the price

Three places where open has caught up

Three categories where the open-weight tier is close enough to closed models that your choice should be driven by license, cost, or deployment preferences — not by capability.

General knowledge and conversational reasoning at typical lengths. The top open-weight models are within striking distance of the closed frontier on chat-style use, factual questions, and structured reasoning that fits in a single context window. The leaderboards don't lie about this. They just don't tell the whole story. For more on the leaderboard problem, see why benchmarks stopped telling you anything.

Code generation on isolated tasks. Given a self-contained programming problem with clear requirements, DeepSeek-V3.1 and Qwen 3 produce output that matches the closed models in quality most of the time. The gap shows up at the architectural scale (multi-file refactors, cross-cutting concerns, design decisions in a real codebase) but for the bread-and-butter task of writing a competent function, the open models are good enough.

Multilingual capability in high-resource languages. The top open models compete strongly across European languages, Chinese, Japanese, and increasingly Arabic. Qwen 3 specifically pushes the Chinese frontier ahead of any closed model you can buy. For organizations doing serious multilingual work, the open-weight tier is a real choice now, not a fallback.

The gap that matters for revenue is the gap that closes slowest. That's no accident. The categories where closed models still lead are the categories where the closed labs have invested the most engineering effort.

Two places where closed still wins

Two categories where the open-weight tier is clearly behind the closed alternatives. For serious production deployments here, stick with closed.

First: long-context retrieval at extreme scale. The closed models — Claude Opus 4.7, GPT-5, Gemini 3.5 Flash, and Gemini 3.1 Pro — have put enormous engineering effort into making their million-token contexts actually usable. Recall stays high, hallucinations stay low, and you can trust the model to quote rather than summarize when asked. Open-weight models with similar nominal context windows show clear drops past the 500K-token mark. Recall drops, false synthesis rises, and the gap to closed-model performance widens with every additional 100K tokens of input.

Second: reliable tool use and agent behavior. The closed labs have spent the better part of a year tuning their frontier models to behave consistently inside agent loops. Call this tool, parse the response, decide the next action, recover gracefully from errors. Open-weight models can do these things in principle, but in practice they need a lot more scaffolding to stay on task, recover from tool failures, and avoid getting stuck. For any production workflow that involves multi-step tool use, the closed models stay clearly ahead.

Llama 4 Maverick

405B Community License · Reasoning

Mistral Large 2

123B Research License · EU langs

Qwen 3 235B MoE

235B Apache 2.0 · Multilingual

DeepSeek-V3.1

671B MIT · Code + math
  1. Feb 2024 Mistral Large

    First serious open-weight competitor to GPT-4.

  2. Jul 2024 Llama 3.1 405B

    Meta's first frontier-class open model.

  3. Dec 2024 DeepSeek-V3

    Open MoE that closed the cost gap.

  4. Aug 2025 Qwen 3 235B

    Apache-licensed, strong Arabic and Asian language support.

  5. Sep 2025 Llama 4 Maverick / Scout

    Frontier reasoning + 10M context tier.

  6. Dec 2025 DeepSeek-V3.1

    Refinement of V3, even tighter code+math benchmarks.

The comparison table

Open-weight frontier models, benchr survey, January 2026
ModelParametersLicenseBest atAvoid for
Llama 4 Maverick405B denseLlama 4 CommunityHard reasoning, top open tierAgent loops, long-doc retrieval
Llama 4 Scout88B MoELlama 4 CommunitySingle-GPU deploymentAnything needing top accuracy
Mistral Large 2123B denseMistral ResearchEuropean languages, voiceLong context, multi-file code
Qwen 3 235B MoE235B (32B active)Apache 2.0Chinese, multilingual, codeStrict format compliance
DeepSeek-V3.1671B (37B active)MIT-styleCode, math, cost-sensitive useSafety-critical applications

Granite (IBM's openly-licensed line) and Phi (Microsoft's small-model family) aren't in this survey. Granite is solid for enterprise text work but doesn't compete at the frontier. Phi gets its own piece in the small-model review.

The decision rule

If you're building anything that has to run inside a regulated environment with no data leaving your network, open weights aren't the better choice. They're the only choice. The capability gap, where it exists, is worth absorbing to avoid the compliance gap of sending data to a closed API.

If the unit economics of your workload are dominated by per-token cost — high-volume inference, batch document processing, anything serving thousands of requests per minute — DeepSeek-V3.1 on a hosted endpoint or Qwen 3 on your own hardware will beat the closed alternatives by an order of magnitude on dollars per query.

If your workload depends on the model reliably calling tools, navigating agent loops, or maintaining coherence across hundreds of thousands of tokens, stay on closed. The gap is real and it isn't closing as fast as the headline capability gap.

When there's no strong prior either way: prototype on a closed model for development speed, then re-test the production path on Qwen 3 235B or DeepSeek-V3.1 before scaling. About half the time, the open model will work fine and save you real money. The other half, you'll discover a specific failure mode that justifies the closed-model premium. The answer differs by use case in a way no general rule can capture.

Open-weight models in late 2025 are good enough to be the right answer for most workloads that don't depend on long-context retrieval at extreme scale or on reliable agent behavior. The capability gap has closed on the bread-and-butter work of conversational use, isolated code generation, and high-resource multilingual reasoning. The license terms on Mistral and Qwen are clean enough for confident commercial deployment.

The two categories where closed still leads are the categories where most production money goes. That isn't an accident. The closed labs have prioritized the workflows that generate the highest-value revenue, and the open-weights labs have followed at a small but persistent distance. Whether that gap closes in 2026 depends mostly on whether the open-weights labs decide to focus on the same engineering work the closed labs have been doing for a year, which isn't yet clear.

If you have to pick one open-weight model for 2026 deployment, go with Qwen 3 235B MoE. License clarity, multilingual range, code competence, architectural maturity. The most versatile of the four. DeepSeek-V3.1 beats it on raw cost-performance in my testing. Llama 4 Maverick beats it at the top end of reasoning. Mistral Large 2 beats it on European languages and on the cleanness of its prose. The brand shouldn't drive the call. The workload should.

Bottom line

For most production workloads, Qwen 3 235B (Apache 2.0) is the open-weight default. For code and math, DeepSeek-V3.1 (MIT). For European languages, Mistral Large 2. The closed models stay ahead on agent loops and long-context retrieval at extreme scale — for those, pay for the frontier API. Everywhere else, open weights are competitive enough that the choice should be driven by license and cost.

Frequently asked

What's the best open-weight model in 2026?

Qwen 3 235B MoE under Apache 2.0 license is the most versatile pick. DeepSeek-V3.1 (MIT) beats it on code and math. Llama 4 Maverick is strongest on raw reasoning. The right one depends on your workload.

Are open-weight models good enough for production?

For most workloads, yes. Open weights have closed the gap on general reasoning, isolated code generation, and high-resource multilingual work. They haven't closed the gap on agent loops and long-context retrieval at extreme scale.

Which open-weight license is cleanest for commercial use?

Apache 2.0 (used by Qwen 3 and several Mistral variants). MIT (used by DeepSeek-V3.1, Phi-4). The Llama 4 Community License works for almost everyone except services with 700M+ monthly users.

How does DeepSeek-V3.1 compare to closed models?

On math and code, DeepSeek-V3.1 is competitive with Claude Opus 4.7 on isolated benchmarks. The English writing is at the top tier. Weakest on tool use and lighter safety-tuning than Western users may expect.

Should I host my own open-weight model?

Only if you have specific reasons: data residency, latency, flat cost at high volume, or the ability to fine-tune. Otherwise, hosted endpoints (Together, Fireworks, DeepInfra) are cheaper than DIY for sub-millions-of-queries-per-day workloads.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Corrected fictional model references — replaced Mistral Large 3 with Large 2 (Large 3 isn't released) and DeepSeek-V4 with V3.1 (V4 isn't released).
  • January 18, 2026 — Originally published.

References

  1. Meta AI, "Llama 4: Multimodal Intelligence," ai.meta.com/blog/llama-4-multimodal-intelligence, September 2025.
  2. Meta, "Llama," llama.com, accessed May 2026.
  3. Mistral AI, "Large Enough" (Mistral Large 2 release), mistral.ai/news/mistral-large-2407, July 2024.
  4. DeepSeek, "Product site," deepseek.com, accessed May 2026.
  5. Alibaba, "Qwen," qwen.ai, accessed May 2026.
  6. "Qwen organization on Hugging Face," huggingface.co/Qwen, accessed May 2026.