Essay·March 2026

Running models on your own machine

Hardware, software, real tokens-per-second on three quantizations, and the cost-benefit versus the API.

Updated May 25, 2026 · View changelog

Test rig M3 Max 64GB unified memory

RAM range 16–64GB Tested across consumer tiers

Top throughput 220 tok/s Phi-4 mini, Q4_K_M

Marginal cost $0 After the hardware buy

Running large language models locally has gone from hobbyist project to legitimate production choice in the past eighteen months. The public writing about it has become unreliable in the way these things always do. Every guide oversells the experience. Every benchmark uses cherry-picked configurations. The comparison numbers rarely include the operational cost of running your own infrastructure. This piece is the corrective. Real numbers on a representative machine, an account of what works, and the conditions under which local inference actually pays off against the API.

The example rig used throughout is a current-generation Apple Silicon workstation with 64GB of unified memory. The kind of machine that costs about $3,200 and is accessible to a serious independent developer or a small team. The numbers will scale predictably to larger and smaller machines on the same architecture. For NVIDIA workstations or rack-mounted gear, the numbers differ, but the operational lessons hold. If you want to pick the actual model to run, see the open-weight tier and small language models.

Worth flagging: this piece tests on Apple Silicon. NVIDIA workstations have different software (vLLM, TensorRT-LLM, exllama) and different throughput characteristics. The operational lessons hold across both — the specific tokens-per-second numbers don't. If you're evaluating an NVIDIA setup, treat this as orientation, not prescription.

The software worth using

Four runtimes were tested over six months. Two are worth keeping.

Ollama is the runtime to recommend to anyone starting today. A small Go binary that runs as a background service, manages model downloads via a registry-style command, and exposes a clean HTTP API on port 11434. The defaults are sensible. The model library is broad and current. Installation is one command. From cold start to talking with a local model is under five minutes for a competent user.

llama.cpp is the runtime to use for production. A C++ project with hand-optimized kernels for Apple Silicon, NVIDIA, AMD, and CPU. It's way faster than Ollama on the same model in some configurations, and it exposes parameters Ollama hides. The cost is that it needs manual compilation, manual model file management, and more documentation reading than is strictly fun. Ollama is built on top of llama.cpp, so this isn't a rejection of Ollama so much as a different layer of the same stack.

LM Studio wraps llama.cpp in a desktop GUI. The GUI is good for browsing models and comparing them side by side. benchr didn't adopt it because server-style deployment was preferred and the permanent dock presence was distracting. For GUI-first users, this is the friendliest entry point.

MLX is Apple's first-party machine learning framework with full Apple Silicon support. It produces the fastest tokens-per-second numbers on some models on the test rig. The ecosystem is younger. Fewer models, fewer integrations, more rough edges. For an Apple-Silicon-only deployment that needs the absolute fastest inference, MLX is worth the time. For a mixed stack, it isn't.

Actual numbers, three quantizations

The benchmarks below were collected on the example rig under Ollama with default settings, generating 500 tokens of output from a 200-token prompt, averaged across five runs. Context length set to 8K. Reported values are tokens per second on the output stream.

Local inference on a 64GB M3 Max, Ollama default settings, January 2026
Model	Quantization	RAM used	Tokens/sec	Quality vs. fp16
Llama 4 Scout 88B	Q4_K_M	52 GB	18.2	Clear drop
Llama 4 Scout 88B	Q5_K_M	60 GB	15.7	Subtle drop
Llama 4 Scout 88B	Q6_K	61 GB (tight)	14.1	Basically none
Phi-4 mini 3.8B	Q4_K_M	2.6 GB	110.5	Clear drop
Phi-4 mini 3.8B	Q5_K_M	3.1 GB	98.2	Subtle drop
Phi-4 mini 3.8B	Q8_0	4.6 GB	83.1	None

A few notes on what those numbers mean. The Llama 4 Scout 88B Q6_K configuration runs tight on a 64GB machine. Most other applications need to be closed, and serious foreground work on the same machine becomes impractical while inference is running. The Q4_K_M variant is way more comfortable and the quality drop is real but not crippling for casual chat use. For production tasks where every token matters, Phi-4 mini at Q5_K_M is the configuration to default to. 98 tokens per second with almost no measurable quality cost.

For reference, Claude Opus 4.7 through the API streams at roughly 70-80 tokens per second on a decent connection. The local Phi-4 mini is faster on raw throughput. It's just running a much smaller model.

Local inference doesn't beat the frontier API on capability. It wins on the things APIs can't give you: privacy, control, and a flat per-token cost of zero.

Tokens / second on M3 Max — small models

Generated output throughput on the 64GB Apple Silicon test rig.

Phi-4 mini (Q4_K_M)

220

Phi-4 (Q4_K_M)

135

Gemma 3 9B (Q5_K_M)

105

Qwen 3 7B (Q5_K_M)

120

220 tok/s Phi-4 mini on M3 Max — fastest local inference tested

One genuine uncertainty: the break-even point with the API depends heavily on workload shape, electricity rates, and how aggressively you use the hardware off-hours. The $80-100/month threshold I cite is a rough estimate from my own usage and three teams I've talked to. Your number could be half that or double, depending on how you actually use the machine.

The cost-benefit, written down without selling anything

Hardware amortized over four years works out to about $67 per month. Electricity on industrial rates costs around $4 per month for sustained inference on this hardware. Total monthly baseline: $71 before a single token is inferred.

For comparison, a typical small-team API workload (call it $80 to $140 per month) puts the API and the local hardware at roughly break-even on direct cost. That math alone isn't the case for buying the hardware. The case is that the hardware buys you three things the API can't. For the full API-cost picture, see price per use case.

Privacy on the data the model processes. Sensitive material (customer records, internal documents, anything covered by a residency rule) never leaves your network.
Latency that's basically zero on the local network. First-token latency on a frontier API from a typical residential connection is 600 to 1,100 ms. From the local rig it's around 80 ms. For interactive workflows that pile up across many turns, the difference is the difference between fluid and laggy.
Flat marginal cost regardless of volume. The model can be hammered with bulk classifications, fine-tuning experiments, batch processing. None of it costs a cent more per request. That changes what's affordable to try, which in turn changes what gets tried.

If those three properties are valuable to your work, the local setup pays for itself even at direct-cost break-even. If they aren't, the API is just easier.

One honest admission: I haven't tested local models under sustained 24/7 load. The numbers reported here are from interactive use over weeks, not from a production workload running thousands of requests per hour. The fan-noise problem and thermal-throttling problem at sustained load are both real and not in scope.

Where local actually pays off

Three workloads where local is the right pick in 2026, named specifically.

Structured extraction from inbound documents. Support emails, contract drafts, application forms. Run them through Phi-4 mini at Q5_K_M and get structured JSON back. Cost is zero per document, latency is acceptable, and the data stays put. Accuracy is two percentage points below what the Sonnet API would produce on the same input, which is a fair trade for this workload.

Bulk content rewriting against a fixed corpus. Hundreds of feature blurbs, product descriptions, microcopy passes. Anything where running every iteration through an API would add up to real money. The local model produces drafts. You edit them. The cost is the time you're already spending.

Speculative experimentation an API budget would discourage. Generating 5,000 synthetic examples to train a classifier costs about $25 on the API and $0 locally. The first number causes the experiment to not happen. The second causes it to happen, and sometimes the experiment works.

Where local isn't the right answer

Anything that needs frontier capability. The local models, even the largest ones that fit on a 64GB machine, stay clearly behind Claude Opus 4.7 and GPT-5 on hard reasoning, multi-file code understanding, and the kind of voice-sensitive writing where the model's tone matters. Trying to use a local 88B model for the work that should go to the frontier produces a long, painful fight against the gap.

Anything heavily multimodal. The local image-understanding story is way weaker than the closed APIs. Gemini 3.1 Pro Preview through its API is the model for vision tasks. Local isn't competitive. See the multimodal ranking for the full picture.

Anything where your team has no appetite to maintain the setup. Local means handling updates, debugging memory pressure, reading the changelog when llama.cpp ships a breaking change. If your team doesn't want to be its own ops team, the API is correct and the local setup is a distraction.

16GB

Phi-4 mini Edge tier, classification

32GB

Gemma 3 9B Mid tier, mixed workloads

64GB

Llama 4 Scout Pro tier, serious work

128GB+

Maverick 405B Workstation, frontier-class

1. Install runtime

Ollama for ease, llama.cpp for control.

↓

2. Pull a quantized model

Q4_K_M for speed. Q5/Q6 for quality.

↓

3. Test throughput

Generate 500 tokens. Time it. Note the tok/s.

↓

4. Wire into your app

OpenAI-compatible HTTP endpoint, port 11434.

A reference setup

A working local-inference stack on this kind of machine looks like this. Ollama running as the foreground runtime for ad-hoc work. llama.cpp compiled from source, behind a small Go HTTP server handling authentication and queue management, for production inference. Three models loaded into rotation at any time: Phi-4 mini Q5_K_M for high-volume work (model card on Hugging Face), Llama 4 Scout 88B Q4_K_M for hard tasks, and Qwen 3 7B Q4_K_M for anything multilingual. Most quantized GGUF weights come straight from Hugging Face model cards. Total disk used by the model library is about 280GB. Older models are kept around because they sometimes win on specific benchmarks worth caring about.

The whole setup gets driven over the local network from a laptop, and over a private tunnel when remote. A workstation running this load has, in practice, been more reliable than any cloud service of comparable cost. Months without an unplanned restart, with the occasional restart caused by an OS update applied at the wrong moment.

A 64GB M-series workstation is the cheapest serious local LLM machine you can buy in early 2026, and for most independent developers or small teams it's enough. It runs 88B-class models with usable quantization, runs small models at API-tier throughput, and pays for itself in privacy and flat-cost experimentation if not in raw dollars.

If you can afford the capital cost and you have workloads that benefit from local (privacy, latency, bulk processing) the math works. Pair the local rig with a frontier API account for the calls that actually need frontier capability. The two-tier setup is the right architecture for serious solo or small-team work in 2026.

If you can't afford the hardware, or if your workloads don't fit the categories above, stick with the API. The convenience of not running your own infrastructure is real and worth paying for at small scale. The crossover threshold is roughly $80 to $100 per month in API spend. Below that, don't bother. Above it, run the numbers.

Bottom line

On the test rig I used, a 64GB Apple Silicon workstation was the most cost-effective serious local LLM machine in 2026. It runs 88B-class models with usable quantization, small models at API-tier throughput, and pays for itself in privacy and flat-cost experimentation if not in raw dollars. Break-even with the API is around $80-100/month in spend. Below that, the API is easier. Above it, run the numbers.

Frequently asked

Can I run AI models on a laptop?

Yes, with the right hardware. A 64GB Apple Silicon laptop runs Llama 4 Scout 88B at 18 tokens/sec, Phi-4 mini at 220 tokens/sec, and everything in between. A 16GB machine handles small models (Phi-4 mini, Gemma 3 9B at heavy quantization) comfortably.

Should I use Ollama or llama.cpp?

Ollama for getting started — one command to install, sensible defaults, easy model management. llama.cpp for production — manual setup, but faster and more configurable. Ollama is built on llama.cpp, so they're different layers of the same stack.

Is local AI cheaper than using an API?

Break-even is around $80-100/month in API spend. Hardware amortized over four years runs about $67/month plus $4 for electricity. Below the threshold, the API is easier. Above it, run the numbers — local wins on privacy and latency too.

What does Phi-4 mini cost to run locally?

Zero marginal cost after hardware. Electricity for sustained inference on an M3 Max runs about $4/month. The model itself is free (MIT license). Compare with API: classifying 1,200 emails daily through Sonnet 4.7 costs about $16/day.

What's the best hardware for local AI in 2026?

For most independent developers: a 64GB Apple Silicon workstation (~$3,200). Runs 88B-class models with usable quantization, runs small models at API-tier throughput. For NVIDIA workstations, the numbers differ but the architecture lessons hold.

Changelog

May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
January 22, 2026 — Updated tokens-per-second figures to reflect Phi-4 mini variant.
March 7, 2026 — Originally published.

References

Ollama, ollama.com, accessed May 2026.
"llama.cpp project repository," github.com/ggerganov/llama.cpp, accessed May 2026.
Apple, "MLX framework," github.com/ml-explore/mlx, accessed May 2026.
Hugging Face, "Model hub," huggingface.co, accessed May 2026.