"Open-weight has caught up." That's the take heard most often in 2026, usually from people who haven't tried running an open model in a serious production workflow. The opposite take — "open-weight is still years behind" — comes from people who haven't looked at the leaderboards in a while. Both are wrong because both average across categories that have moved at very different speeds. That they've caught up to closed models. That they're still way behind. Both average across categories that have moved at very different speeds. Open weights have closed the gap on general conversational use, on isolated code generation, and on high-resource multilingual reasoning. They haven't closed the gap on long-context retrieval at extreme scale or on reliable tool use inside agent loops. The four models covered here — Llama 4 in Maverick and Scout configurations, Mistral Large 2, Qwen 3, and DeepSeek-V3.1 — show this unevenness in different ways.
The lineup: Llama 4 in its two main shipped configurations (the 405B-parameter dense Maverick and the 88B mixture-of-experts Scout), Mistral Large 2 at 123B dense parameters, Qwen 3 in both the 72B dense and 235B MoE variants, and DeepSeek-V3.1 at 671B MoE with roughly 37B active per token. All four ran on real workloads through the past three months, both via hosted endpoints and via local inference on workstation-grade hardware. If you want to run any of these yourself, see running models on your own machine.
One genuine uncertainty before walking through the lineup: I can't fully answer whether DeepSeek-V3.1 is better than V3.0 at math in the ways that matter. The benchmark improvements are real and reproducible. The improvement on the actual math problems I care about — finance modeling, specifically — is marginal. The benchmark gap implies a bigger real-world gap than I've been able to confirm.
Llama 4
Meta released Llama 4 in September 2025, per Meta AI's Llama 4 announcement, in two main configurations. Maverick is the 405B dense model intended for serious GPU deployment. Scout is the 88B mixture-of-experts variant that activates roughly 22B parameters per forward pass and runs on a single 80GB H100 with sensible quantization. Weights and downloads are available at llama.com.
Maverick is the strongest open-weight model on hard reasoning tasks at the end of 2025. On chain-of-thought problems and structured multi-step reasoning, it's the open-side model to reach for. Scout is the workhorse. It gives up some peak capability for way more accessible hardware requirements. The instruction-tuned variants of both are good but feel slightly less refined than the base models, a pattern consistent with how Meta has been tuning lately.
The license is the Llama 4 Community License. Permissive for almost everyone, with a clause forbidding use by services with more than 700M monthly active users. For a small team or a working developer, that's irrelevant. For a large company, read the license carefully against the specific deployment context.
Worth flagging: Mistral Large 2 is the production line in May 2026, but Mistral has been hinting at a successor for months. By the time you read this, there may be a Large 3 or equivalent. The methodology here applies regardless — score the new model against the same tests when it ships.
Mistral Large 2
Released July 2024 at 123B dense parameters, per Mistral's Large 2 announcement. Still the open-weights lab with the strongest house style. A clean, structured output preference, a willingness to commit to opinions instead of hedging endlessly, and clearly stronger European-language work than the alternatives. The context window sits at 128K tokens, but the context the model does have is unusually well-used.
The license is the Mistral Research License. Permissive for research and personal use, with separate commercial terms required for paid deployments. It's not as clean as Apache 2.0, but the terms are straightforward and predictable. If your deployment is internal and non-commercial, you can use Mistral Large 2 today without further negotiation. For commercial use, contact Mistral.
Qwen 3
Alibaba released Qwen 3 in October 2025 across several variants, with the official rundown at qwen.ai and model cards hosted on Hugging Face under the Qwen organization. The two worth your attention are the 72B dense model and the 235B MoE that activates about 32B parameters per token. Qwen 3 is the strongest open-weight model on Chinese-language work, and one of the better ones on Arabic and Japanese. The code understanding is competitive with mid-tier closed models in a way that surprises anyone who only knows Qwen by its earlier reputation.
Apache 2.0 license on most variants. License clarity is the cleanest of the lineup. The model's instruction-following tends to drift back to its preferred output shape after a few turns of conversation, which is a limitation in agentic workflows. For single-shot or short-conversation use, the quality matches or beats the alternatives across most categories.
DeepSeek-V3.1
Released in late 2025 as a refinement of the V3 line — releases and docs at deepseek.com. The V3.1 update sits at 671B-parameter mixture-of-experts that activates roughly 37B parameters per forward pass. DeepSeek has built the most aggressive open-weights story of any current lab. Detailed technical reports, model cards with real numbers instead of marketing language, and hosted-endpoint pricing way below the Western alternatives.
For coding and math, DeepSeek-V3.1 is competitive with Claude Opus 4.7 on isolated tasks. The reasoning quality on math problems is the strongest in the open-weights field. The English-language writing is at the top tier. The weaknesses are in tool use (less reliable than the closed alternatives) and in safety-tuning depth (refusals are clearly lighter than what Western users may expect from a frontier model).
The license is the MIT-style DeepSeek License. Permissive with use-case restrictions worth reading if the deployment touches anything sensitive.
Three places where open has caught up
Three categories where the open-weight tier is close enough to closed models that your choice should be driven by license, cost, or deployment preferences — not by capability.
General knowledge and conversational reasoning at typical lengths. The top open-weight models are within striking distance of the closed frontier on chat-style use, factual questions, and structured reasoning that fits in a single context window. The leaderboards don't lie about this. They just don't tell the whole story. For more on the leaderboard problem, see why benchmarks stopped telling you anything.
Code generation on isolated tasks. Given a self-contained programming problem with clear requirements, DeepSeek-V3.1 and Qwen 3 produce output that matches the closed models in quality most of the time. The gap shows up at the architectural scale (multi-file refactors, cross-cutting concerns, design decisions in a real codebase) but for the bread-and-butter task of writing a competent function, the open models are good enough.
Multilingual capability in high-resource languages. The top open models compete strongly across European languages, Chinese, Japanese, and increasingly Arabic. Qwen 3 specifically pushes the Chinese frontier ahead of any closed model you can buy. For organizations doing serious multilingual work, the open-weight tier is a real choice now, not a fallback.
The gap that matters for revenue is the gap that closes slowest. That's no accident. The categories where closed models still lead are the categories where the closed labs have invested the most engineering effort.
Two places where closed still wins
Two categories where the open-weight tier is clearly behind the closed alternatives. For serious production deployments here, stick with closed.
First: long-context retrieval at extreme scale. The closed models — Claude Opus 4.7, GPT-5, Gemini 3.5 Flash, and Gemini 3.1 Pro — have put enormous engineering effort into making their million-token contexts actually usable. Recall stays high, hallucinations stay low, and you can trust the model to quote rather than summarize when asked. Open-weight models with similar nominal context windows show clear drops past the 500K-token mark. Recall drops, false synthesis rises, and the gap to closed-model performance widens with every additional 100K tokens of input.
Second: reliable tool use and agent behavior. The closed labs have spent the better part of a year tuning their frontier models to behave consistently inside agent loops. Call this tool, parse the response, decide the next action, recover gracefully from errors. Open-weight models can do these things in principle, but in practice they need a lot more scaffolding to stay on task, recover from tool failures, and avoid getting stuck. For any production workflow that involves multi-step tool use, the closed models stay clearly ahead.
Llama 4 Maverick
405B Community License · ReasoningMistral Large 2
123B Research License · EU langsQwen 3 235B MoE
235B Apache 2.0 · MultilingualDeepSeek-V3.1
671B MIT · Code + math-
Feb 2024
Mistral Large
First serious open-weight competitor to GPT-4.
-
Jul 2024
Llama 3.1 405B
Meta's first frontier-class open model.
-
Dec 2024
DeepSeek-V3
Open MoE that closed the cost gap.
-
Aug 2025
Qwen 3 235B
Apache-licensed, strong Arabic and Asian language support.
-
Sep 2025
Llama 4 Maverick / Scout
Frontier reasoning + 10M context tier.
-
Dec 2025
DeepSeek-V3.1
Refinement of V3, even tighter code+math benchmarks.
The comparison table
| Model | Parameters | License | Best at | Avoid for |
|---|---|---|---|---|
| Llama 4 Maverick | 405B dense | Llama 4 Community | Hard reasoning, top open tier | Agent loops, long-doc retrieval |
| Llama 4 Scout | 88B MoE | Llama 4 Community | Single-GPU deployment | Anything needing top accuracy |
| Mistral Large 2 | 123B dense | Mistral Research | European languages, voice | Long context, multi-file code |
| Qwen 3 235B MoE | 235B (32B active) | Apache 2.0 | Chinese, multilingual, code | Strict format compliance |
| DeepSeek-V3.1 | 671B (37B active) | MIT-style | Code, math, cost-sensitive use | Safety-critical applications |
Granite (IBM's openly-licensed line) and Phi (Microsoft's small-model family) aren't in this survey. Granite is solid for enterprise text work but doesn't compete at the frontier. Phi gets its own piece in the small-model review.
The decision rule
If you're building anything that has to run inside a regulated environment with no data leaving your network, open weights aren't the better choice. They're the only choice. The capability gap, where it exists, is worth absorbing to avoid the compliance gap of sending data to a closed API.
If the unit economics of your workload are dominated by per-token cost — high-volume inference, batch document processing, anything serving thousands of requests per minute — DeepSeek-V3.1 on a hosted endpoint or Qwen 3 on your own hardware will beat the closed alternatives by an order of magnitude on dollars per query.
If your workload depends on the model reliably calling tools, navigating agent loops, or maintaining coherence across hundreds of thousands of tokens, stay on closed. The gap is real and it isn't closing as fast as the headline capability gap.
When there's no strong prior either way: prototype on a closed model for development speed, then re-test the production path on Qwen 3 235B or DeepSeek-V3.1 before scaling. About half the time, the open model will work fine and save you real money. The other half, you'll discover a specific failure mode that justifies the closed-model premium. The answer differs by use case in a way no general rule can capture.
Open-weight models in late 2025 are good enough to be the right answer for most workloads that don't depend on long-context retrieval at extreme scale or on reliable agent behavior. The capability gap has closed on the bread-and-butter work of conversational use, isolated code generation, and high-resource multilingual reasoning. The license terms on Mistral and Qwen are clean enough for confident commercial deployment.
The two categories where closed still leads are the categories where most production money goes. That isn't an accident. The closed labs have prioritized the workflows that generate the highest-value revenue, and the open-weights labs have followed at a small but persistent distance. Whether that gap closes in 2026 depends mostly on whether the open-weights labs decide to focus on the same engineering work the closed labs have been doing for a year, which isn't yet clear.
If you have to pick one open-weight model for 2026 deployment, go with Qwen 3 235B MoE. License clarity, multilingual range, code competence, architectural maturity. The most versatile of the four. DeepSeek-V3.1 beats it on raw cost-performance in my testing. Llama 4 Maverick beats it at the top end of reasoning. Mistral Large 2 beats it on European languages and on the cleanness of its prose. The brand shouldn't drive the call. The workload should.