Guide·June 2026·Published June 19, 2026

Your computer can't run the big open models. Here's what actually works.

Why DeepSeek and Llama 70B won't load on your laptop, and the four real fixes, from quantization to renting a cloud GPU.

By the benchr team · Updated June 19, 2026 · View changelog

70B at full precision 140GB FP16, 2 bytes per weight

70B at Q4_K_M 40GB Still past a 24GB card

Typical gaming GPU 8GB Holds a 7B model at Q4

Rented A100 80GB $1.39/hr RunPod, June 2026

You found a model you want to run. Maybe it's Llama 3.3 70B, maybe it's a DeepSeek variant everyone's talking about. You download the weights, point your runtime at them, and the load fails or the machine crawls to a halt. This isn't a bug, and it isn't your setup. The model simply needs more graphics memory than your computer has. The good news is the problem is well understood and the fixes are concrete, so let's walk through why it happens and then through every real way around it.

Why it won't load: the VRAM math

A model's weights have to sit in memory before the GPU can use them, and the rough rule is short: the VRAM you need in gigabytes is about the number of parameters in billions times the bytes per weight. Full precision (FP16) is 2 bytes per weight. So a 70B model in FP16 wants around 140GB. Compress it to 8-bit (Q8) and each weight drops to about 1 byte, which roughly halves the figure. Go to 4-bit (Q4_K_M) and a weight costs about half a byte, so the same 70B model lands near 40GB of weights.

Then add a little on top. The KV cache, which holds the running context, plus general overhead, eats another 1 to 2GB at short context and grows from there. Longer context isn't free either: the KV cache scales up linearly with how many tokens you keep in the window, so a long-context session can add several gigabytes the math above doesn't show.

The number on the model card is the floor, not the ceiling. Parameters set the weight size; context length quietly adds the rest.

That single rule explains every failed load. Your card has a fixed amount of VRAM, the model wants more than that, and there's no setting that makes the gap disappear. What you can change is the size of the model you ask it to hold.

What your GPU can actually hold

Here's the part the spec sheets bury. Consumer GPUs are sized for games, not 70B language models, and the jump from "fits" to "doesn't fit" is sharp. The table below pairs common VRAM tiers with the largest model that lands comfortably at Q4, the quantization most people run.

Approximate model VRAM at Q4_K_M and what fits on each GPU tier, June 2026. Figures are rough and grow with context length
Model	Params	VRAM at Q4	Fits on
Llama / Qwen class	7B–8B	~5 GB	8GB card (RTX 3060/4060)
Mid-size	13B–14B	~9 GB	16GB card
DeepSeek-R1-Distill	32B	~18 GB	24GB card (RTX 3090/4090)
Llama 3.3 / Qwen 72B	70B–72B	~40 GB	48GB (2×4090 or M-series Max)
DeepSeek-V3 / R1 (full)	~671B (MoE)	130–250 GB+	Enterprise multi-GPU only

Read it from the top. An 8GB card, an RTX 3060 or 4060, runs a 7B or 8B model at Q4 and stops there. A 16GB card gets you to roughly 13B or 14B. A 24GB card, the RTX 3090 or 4090 that anchors most serious home setups, tops out near a 32B model at Q4. Note where 70B is: not on the 24GB line. A 70B model at Q4 needs about 40GB, so the realistic local ceiling for it is 48GB, which means two 4090s wired together or an Apple Silicon Max chip with 48GB or more of unified memory.

So if your model won't load, find its row. If the model is a tier above what your card holds, the next four sections are the way through. For a fuller account of throughput, software, and the cost math when the hardware does fit, see running models on your own machine.

Fix 1: Quantize the model

Quantizing means storing each weight in fewer bits. Instead of 16 bits per weight you use 8, or 4, and the file shrinks in proportion. A 70B model that's 140GB in FP16 becomes about 40GB at Q4, which is the difference between needing a server and needing one good card or a rented pod. The formats you'll see are GGUF (used by llama.cpp and Ollama), plus AWQ and GPTQ on the GPU-server side.

The fair question is what it costs you in quality, and the answer is less than the memory savings suggest. Q4_K_M, the variant most people land on, raises perplexity by under about 1% against full FP16 on most tasks while roughly halving the footprint. Q8 is close to lossless. The floor is real, though: at Q3 and below, reasoning quality drops in a way you'll feel on hard problems, so don't quantize past Q4 just to squeeze onto a card that's a size too small. Most quantized GGUF weights are published right on the Hugging Face model cards, so this fix is free and usually a single download.

A 70B model's weights, by quantization

Approximate VRAM for the weights alone, before KV cache and overhead.

FP16 (full)

140 GB

70 GB

Q4_K_M

40 GB

Fix 2: Drop to a smaller or distilled model

The model you picked may be bigger than the job needs. A distilled model is a smaller one trained to copy a larger one's behavior, and for many tasks the gap is narrower than the parameter count implies. DeepSeek is the clean example: you can't run the full 671B model at home, but DeepSeek-R1-Distill-32B fits in about 18GB at Q4, the 14B version in roughly 9GB, and the 7B distill in around 5GB on an 8GB card. Same family, same flavor of output, a fraction of the memory.

This is the fix to reach for when a 7B or 14B model would do the work and you were only running the 70B out of habit. It costs nothing, it loads on the card you already own, and the throughput is far better because the model is smaller. To pick a model that fits your hardware on purpose, see small language models for the sub-10B tier and the open-weight tier right now for where each family stands.

8GB

7B distill ~5GB at Q4

16GB

14B ~9GB at Q4

24GB

32B ~18GB at Q4

48GB

70B ~40GB at Q4

Fix 3: Use a hosted API

If you don't actually care about running the weights yourself, the least-effort answer is to not run them at all. A hosted API serves the model from someone else's hardware and bills you per token, with zero setup and no VRAM to think about. For the full DeepSeek model, this is often the sensible call: you skip the 130-to-250GB hardware problem entirely and pay only for what you send. The trade is that your data leaves your machine and the cost scales with usage rather than sitting flat.

This path makes sense when the model is too big to host, when you need it occasionally, or when buying and maintaining hardware isn't worth it for your volume. To compare per-token prices across the open and closed field, the model rankings list each one, and the cost calculator turns your token volume into a monthly number so you can weigh it against the other fixes.

Fix 4: Rent a cloud GPU by the hour

Here's the option that solves the case the other three don't: you want the full-size model, on a real GPU you control, without spending thousands on a card. You rent the GPU by the hour. You get root on a machine with an A100 or H100 in it, load whatever weights you like, run your job, and shut it down when you're done. No purchase, no model swap, no per-token metering on someone else's terms.

This is the right fix when you need serious VRAM only now and then: a weekend fine-tuning run, a one-off batch job, or testing a 70B model before deciding whether to buy hardware at all. A card you use a few hours a week is mostly an idle expense, and renting turns that fixed cost into a small variable one. The catch is that you pay while the pod runs, so you start it for the work and stop it after, rather than leaving it on.

The pricing is the part that makes it work for occasional jobs. As of June 2026, RunPod lists an RTX 4090 24GB at roughly $0.69 an hour, an A100 80GB at about $1.39, and an H100 80GB at about $2.89, with on-demand pods that come up in around 30 seconds. The one-click vLLM and Ollama templates run any Hugging Face model you point them at, including Llama, Qwen, DeepSeek, Gemma, and Phi, so there's no environment to build from scratch.

Run the comparison for your own case. An A100 at $1.39 an hour is about $33 for a full day, or roughly $1.39 for the one hour you actually needed. Against a $2,000-plus card that sits idle most of the week, renting wins for anything but constant, all-day use. If you do run inference around the clock, that's the point where owning the hardware starts to pay back, and the cost math in running models on your own machine shows where the line sits.

1. Check the model's VRAM

Params in billions × bytes per weight, plus 1–2GB.

↓

2. Fits your card?

Yes: quantize to Q4 and run it local.

↓

3. Close but over?

Drop to a smaller or distilled model that fits.

↓

4. Way over?

Hosted API for no setup, or rent a GPU for full control.

So which fix is yours

Match the fix to the gap. If the model is just a little too big, quantize it to Q4 and run it on the card you have. If it's a tier or two over your card, drop to a smaller or distilled model that fits, since a 14B at Q4 beats a 70B you can't load at all. If the model is far past any single machine, like the full DeepSeek-V3 or R1, the question is control: reach for a hosted API when you want zero setup and per-token billing, and rent a cloud GPU when you want the actual weights on real hardware without buying it.

The one thing not to do is keep trying to force a model onto a card that can't hold it. The VRAM rule doesn't bend. But every model you've heard of has a path to running, whether that's a quantized download, a smaller sibling, an API call, or a $1.39 hour on a rented A100, and now you know which one fits your situation.

Frequently asked

Can I run Llama 70B on a 24GB GPU?

No. A 70B model at Q4_K_M needs about 40GB of VRAM plus overhead, so it won't fit on a single 24GB card like an RTX 3090 or 4090. A 24GB GPU tops out around a 32B model at Q4. To run 70B locally you need roughly 48GB, which means two 4090s or an Apple Silicon Max chip with 48GB or more of unified memory.

Is renting a GPU cheaper than buying one?

For occasional use, yes. RunPod rents an A100 80GB for about $1.39 an hour and an H100 80GB for about $2.89 an hour as of June 2026, with per-second billing. If you only need the big GPU for a few hours a week, renting costs a fraction of a several-thousand-dollar card you'd otherwise leave idle. If you run inference all day every day, owning the hardware wins over time.

What's the cheapest way to run DeepSeek?

The full DeepSeek-V3 or R1 model is about 671B parameters and needs 130 to 250GB or more of VRAM, so it's enterprise multi-GPU territory, not a single-machine job. The cheap path for individuals is a distilled version: DeepSeek-R1-Distill-32B runs in about 18GB at Q4, and the 7B distill fits in roughly 5GB. If you need the full model, a hosted API or a rented multi-GPU pod is the realistic route.

Does quantization hurt quality?

A little, and less than people expect at the right level. Q4_K_M raises perplexity by under about 1% versus full FP16 on most tasks while roughly halving the memory, which is why it's the common sweet spot. Q8 is near-lossless. Below Q4, at Q3 and lower, reasoning quality starts to degrade noticeably, so that's the floor for serious work.

Changelog

June 19, 2026 — Originally published.

References

RunPod, "GPU instance pricing," runpod.io/gpu-instance/pricing, accessed June 2026.
RunPod, "Get started with the vLLM worker," docs.runpod.io, accessed June 2026.
Hugging Face, "Model cards," huggingface.co, accessed June 2026.
Database Mart, "How much VRAM do you need for 7B–70B LLMs," databasemart.com, accessed June 2026.