Running large language models locally has gone from hobbyist project to legitimate production choice in the past eighteen months. The public writing about it has become unreliable in the way these things always do. Most guides oversell the experience, the benchmarks lean on cherry-picked configurations, and the comparison numbers rarely include the operational cost of running your own infrastructure. What follows is the corrective: real numbers on a representative machine, an account of what actually works, and the conditions under which local inference pays off against the API.
The example rig used throughout is a current-generation Apple Silicon workstation with 64GB of unified memory. The kind of machine that costs about $3,200 and is accessible to a serious independent developer or a small team. The numbers will scale predictably to larger and smaller machines on the same architecture. For NVIDIA workstations or rack-mounted gear, the numbers differ, but the operational lessons hold. If you want to pick the actual model to run, see the open-weight tier and small language models.
This piece tests on Apple Silicon. NVIDIA workstations have different software (vLLM, TensorRT-LLM, exllama) and different throughput characteristics. The operational lessons carry over even though the specific tokens-per-second numbers won't, so if you're evaluating an NVIDIA setup, read this for orientation rather than exact figures.
The software worth using
Four runtimes were tested over six months. Two are worth keeping.
Ollama is the runtime to recommend to anyone starting today. A small Go binary that runs as a background service, manages model downloads via a registry-style command, and exposes a clean HTTP API on port 11434. The defaults are sensible and the model library is broad and current. Installation is a single command, and a competent user can go from cold start to talking with a local model in under five minutes.
llama.cpp is the runtime to use for production. A C++ project with hand-optimized kernels for Apple Silicon, NVIDIA, AMD, and CPU. It's way faster than Ollama on the same model in some configurations, and it exposes parameters Ollama hides. The cost is that it needs manual compilation, manual model file management, and more documentation reading than is strictly fun. Ollama is built on top of llama.cpp, so this isn't a rejection of Ollama so much as a different layer of the same stack.
LM Studio wraps llama.cpp in a desktop GUI. The GUI is good for browsing models and comparing them side by side. benchr didn't adopt it because server-style deployment was preferred and the permanent dock presence was distracting. For GUI-first users, this is the friendliest entry point.
MLX is Apple's first-party machine learning framework with full Apple Silicon support. It produces the fastest tokens-per-second numbers on some models on the test rig. The ecosystem is younger, though — fewer models, thinner integrations, and more rough edges to work around. If you're running Apple-Silicon-only and need the absolute fastest inference, MLX is worth the time; for a mixed stack it usually isn't.
Actual numbers, three quantizations
The benchmarks below were collected on the example rig under Ollama with default settings, generating 500 tokens of output from a 200-token prompt, averaged across five runs. Context length set to 8K. Reported values are tokens per second on the output stream.
| Model | Quantization | RAM used | Tokens/sec | Quality vs. fp16 |
|---|---|---|---|---|
| Llama 3.3 70B | Q4_K_M | 52 GB | 18.2 | Clear drop |
| Llama 3.3 70B | Q5_K_M | 60 GB | 15.7 | Subtle drop |
| Llama 3.3 70B | Q6_K | 61 GB (tight) | 14.1 | Basically none |
| Phi-4 mini 3.8B | Q4_K_M | 2.6 GB | 220.0 | Clear drop |
| Phi-4 mini 3.8B | Q5_K_M | 3.1 GB | 98.2 | Subtle drop |
| Phi-4 mini 3.8B | Q8_0 | 4.6 GB | 83.1 | None |
A few notes on what those numbers mean. The Llama 3.3 70B Q6_K configuration runs tight on a 64GB machine. Most other applications need to be closed, and serious foreground work on the same machine becomes impractical while inference is running. The Q4_K_M variant is way more comfortable, and the quality drop is genuine but not crippling for casual chat use. For production tasks where every token matters, default to Phi-4 mini at Q5_K_M, which holds 98 tokens per second with almost no measurable quality cost.
For reference, Claude Opus 4.7 through the API streams at roughly 70-80 tokens per second on a decent connection. The local Phi-4 mini beats that on raw throughput, but only because it's a much smaller model doing a much smaller job.
On raw capability the frontier API still wins. What local inference offers instead is what no API will sell you: your data never leaving the building, and a per-token cost that stays at zero no matter how hard you push it.
The break-even point with the API is genuinely hard to pin down. It depends on workload shape, electricity rates, and how aggressively you use the hardware off-hours. The $80-100/month threshold cited here is a rough estimate based on the public community discussion, and your own number could land at half that or double, depending on how you use the machine.
The cost-benefit, written down without selling anything
Hardware amortized over four years works out to about $67 per month. Electricity on industrial rates costs around $4 per month for sustained inference on this hardware. Total monthly baseline: $71 before a single token is inferred.
For comparison, a typical small-team API workload (call it $80 to $140 per month) puts the API and the local hardware at roughly break-even on direct cost. Break-even on its own doesn't justify the purchase. What does is the set of things the hardware buys you that the API can't, and there are three of them. For the full API-cost picture, see price per use case.
- Privacy on the data the model processes. Sensitive material (customer records, internal documents, anything covered by a residency rule) never leaves your network.
- Latency that's basically zero on the local network. First-token latency on a frontier API from a typical residential connection is 600 to 1,100 ms, against around 80 ms from the local rig. Across an interactive workflow that piles up over many turns, that's what separates a fluid session from a laggy one.
- Flat marginal cost regardless of volume. The model can be hammered with bulk classifications, fine-tuning experiments, and overnight batch processing, none of it costing a cent more per request. When experiments are free to run, you run more of them.
If those three properties matter to your work, the local setup pays for itself even at direct-cost break-even. Without them, there's little reason to take on the hardware, and the API stays the easier call.
One caveat worth stating plainly: the community hasn't tested local models under sustained 24/7 load. The numbers reported here come from interactive use over weeks, not from a production workload running thousands of requests per hour. Fan noise and thermal throttling under that kind of sustained load are both real problems, and both out of scope here.
Where local pays off
Three workloads where local is the right pick in 2026, named specifically.
Structured extraction from inbound documents — support emails, contract drafts, application forms, and the like. Run them through Phi-4 mini at Q5_K_M and get structured JSON back at zero cost per document, with acceptable latency and the data never leaving your network. Accuracy lands about two percentage points below what the Sonnet API would produce on the same input, which is a fair trade for this workload.
Bulk content rewriting against a fixed corpus — hundreds of feature blurbs and product descriptions, microcopy passes, anything where running every iteration through an API would quietly add up. The local model produces the drafts and you edit them, so the only cost is time you're already spending.
Speculative experimentation an API budget would discourage. Generating 5,000 synthetic examples to train a classifier costs about $25 on the API and $0 locally. At $25 the experiment quietly gets shelved; at $0 you just run it, and now and then it pays off.
Where local isn't the right answer
Anything that needs frontier capability. The local models, even the largest ones that fit on a 64GB machine, stay clearly behind Claude Opus 4.7 and GPT-5 on hard reasoning and multi-file code understanding, and on the kind of voice-sensitive writing where the model's tone matters. Push a local 70B model at work that belongs on the frontier and you spend the day fighting the gap.
Anything heavily multimodal. The local image-understanding story is way weaker than the closed APIs, and for vision tasks Gemini 3.1 Pro Preview through its API is simply where the work should go. See the multimodal ranking for the full picture.
Anything where your team has no appetite to maintain the setup. Local means handling updates and debugging memory pressure, and reading the changelog whenever llama.cpp ships a breaking change. A team that doesn't want to be its own ops team should stay on the API and skip the local setup entirely.
16GB
Phi-4 mini Edge tier, classification32GB
Gemma 2 9B Mid tier, mixed workloads64GB
Llama 3.3 70B Pro tier, serious work128GB+
Maverick 400B Workstation, frontier-classOllama for ease, llama.cpp for control.
Q4_K_M for speed. Q5/Q6 for quality.
Generate 500 tokens. Time it. Note the tok/s.
OpenAI-compatible HTTP endpoint, port 11434.
A reference setup
A working local-inference stack on this kind of machine looks like this. Ollama running as the foreground runtime for ad-hoc work. llama.cpp compiled from source, behind a small Go HTTP server handling authentication and queue management, for production inference. Three models loaded into rotation at any time: Phi-4 mini Q5_K_M for high-volume work (model card on Hugging Face), Llama 3.3 70B Q4_K_M for hard tasks, and Qwen 3 7B Q4_K_M for anything multilingual. Most quantized GGUF weights come straight from Hugging Face model cards. Total disk used by the model library is about 280GB. Older models are kept around because they sometimes win on specific benchmarks worth caring about.
The whole setup gets driven over the local network from a laptop, and over a private tunnel when remote. A workstation running this load has, in practice, been more reliable than any cloud service of comparable cost. Months without an unplanned restart, with the occasional restart caused by an OS update applied at the wrong moment.
A 64GB M-series workstation is the cheapest serious local LLM machine you can buy in early 2026, and for most independent developers or small teams it's enough. It runs 70B-class models with usable quantization and small models at API-tier throughput, and even where it doesn't win on raw dollars it pays for itself in privacy and flat-cost experimentation.
If you can afford the capital cost and you have workloads that benefit from local — privacy-bound data, latency-sensitive paths, bulk processing — the math works. Pair the local rig with a frontier API account for the calls that genuinely need frontier capability. For serious solo or small-team work in 2026, that two-tier setup is hard to beat.
If you can't afford the hardware, or if your workloads don't fit the categories above, stick with the API. The convenience of not running your own infrastructure is genuine and worth paying for at small scale. The crossover sits at roughly $80 to $100 per month in API spend — under that it isn't worth the trouble, and once you're clearly past it, sit down and run the numbers for your own workload.