You found a model you want to run. Maybe it's Llama 3.3 70B, maybe it's a DeepSeek variant everyone's talking about. You download the weights, point your runtime at them, and the load fails or the machine crawls to a halt. This isn't a bug, and it isn't your setup. The model simply needs more graphics memory than your computer has. The good news is the problem is well understood and the fixes are concrete, so let's walk through why it happens and then through every real way around it.
Why it won't load: the VRAM math
A model's weights have to sit in memory before the GPU can use them, and the rough rule is short: the VRAM you need in gigabytes is about the number of parameters in billions times the bytes per weight. Full precision (FP16) is 2 bytes per weight. So a 70B model in FP16 wants around 140GB. Compress it to 8-bit (Q8) and each weight drops to about 1 byte, which roughly halves the figure. Go to 4-bit (Q4_K_M) and a weight costs about half a byte, so the same 70B model lands near 40GB of weights.
Then add a little on top. The KV cache, which holds the running context, plus general overhead, eats another 1 to 2GB at short context and grows from there. Longer context isn't free either: the KV cache scales up linearly with how many tokens you keep in the window, so a long-context session can add several gigabytes the math above doesn't show.
The number on the model card is the floor, not the ceiling. Parameters set the weight size; context length quietly adds the rest.
That single rule explains every failed load. Your card has a fixed amount of VRAM, the model wants more than that, and there's no setting that makes the gap disappear. What you can change is the size of the model you ask it to hold.
What your GPU can actually hold
Here's the part the spec sheets bury. Consumer GPUs are sized for games, not 70B language models, and the jump from "fits" to "doesn't fit" is sharp. The table below pairs common VRAM tiers with the largest model that lands comfortably at Q4, the quantization most people run.
| Model | Params | VRAM at Q4 | Fits on |
|---|---|---|---|
| Llama / Qwen class | 7B–8B | ~5 GB | 8GB card (RTX 3060/4060) |
| Mid-size | 13B–14B | ~9 GB | 16GB card |
| DeepSeek-R1-Distill | 32B | ~18 GB | 24GB card (RTX 3090/4090) |
| Llama 3.3 / Qwen 72B | 70B–72B | ~40 GB | 48GB (2×4090 or M-series Max) |
| DeepSeek-V3 / R1 (full) | ~671B (MoE) | 130–250 GB+ | Enterprise multi-GPU only |
Read it from the top. An 8GB card, an RTX 3060 or 4060, runs a 7B or 8B model at Q4 and stops there. A 16GB card gets you to roughly 13B or 14B. A 24GB card, the RTX 3090 or 4090 that anchors most serious home setups, tops out near a 32B model at Q4. Note where 70B is: not on the 24GB line. A 70B model at Q4 needs about 40GB, so the realistic local ceiling for it is 48GB, which means two 4090s wired together or an Apple Silicon Max chip with 48GB or more of unified memory.
So if your model won't load, find its row. If the model is a tier above what your card holds, the next four sections are the way through. For a fuller account of throughput, software, and the cost math when the hardware does fit, see running models on your own machine.
Fix 1: Quantize the model
Quantizing means storing each weight in fewer bits. Instead of 16 bits per weight you use 8, or 4, and the file shrinks in proportion. A 70B model that's 140GB in FP16 becomes about 40GB at Q4, which is the difference between needing a server and needing one good card or a rented pod. The formats you'll see are GGUF (used by llama.cpp and Ollama), plus AWQ and GPTQ on the GPU-server side.
The fair question is what it costs you in quality, and the answer is less than the memory savings suggest. Q4_K_M, the variant most people land on, raises perplexity by under about 1% against full FP16 on most tasks while roughly halving the footprint. Q8 is close to lossless. The floor is real, though: at Q3 and below, reasoning quality drops in a way you'll feel on hard problems, so don't quantize past Q4 just to squeeze onto a card that's a size too small. Most quantized GGUF weights are published right on the Hugging Face model cards, so this fix is free and usually a single download.
Fix 2: Drop to a smaller or distilled model
The model you picked may be bigger than the job needs. A distilled model is a smaller one trained to copy a larger one's behavior, and for many tasks the gap is narrower than the parameter count implies. DeepSeek is the clean example: you can't run the full 671B model at home, but DeepSeek-R1-Distill-32B fits in about 18GB at Q4, the 14B version in roughly 9GB, and the 7B distill in around 5GB on an 8GB card. Same family, same flavor of output, a fraction of the memory.
This is the fix to reach for when a 7B or 14B model would do the work and you were only running the 70B out of habit. It costs nothing, it loads on the card you already own, and the throughput is far better because the model is smaller. To pick a model that fits your hardware on purpose, see small language models for the sub-10B tier and the open-weight tier right now for where each family stands.
8GB
7B distill ~5GB at Q416GB
14B ~9GB at Q424GB
32B ~18GB at Q448GB
70B ~40GB at Q4Fix 3: Use a hosted API
If you don't actually care about running the weights yourself, the least-effort answer is to not run them at all. A hosted API serves the model from someone else's hardware and bills you per token, with zero setup and no VRAM to think about. For the full DeepSeek model, this is often the sensible call: you skip the 130-to-250GB hardware problem entirely and pay only for what you send. The trade is that your data leaves your machine and the cost scales with usage rather than sitting flat.
This path makes sense when the model is too big to host, when you need it occasionally, or when buying and maintaining hardware isn't worth it for your volume. To compare per-token prices across the open and closed field, the model rankings list each one, and the cost calculator turns your token volume into a monthly number so you can weigh it against the other fixes.
Fix 4: Rent a cloud GPU by the hour
Here's the option that solves the case the other three don't: you want the full-size model, on a real GPU you control, without spending thousands on a card. You rent the GPU by the hour. You get root on a machine with an A100 or H100 in it, load whatever weights you like, run your job, and shut it down when you're done. No purchase, no model swap, no per-token metering on someone else's terms.
This is the right fix when you need serious VRAM only now and then: a weekend fine-tuning run, a one-off batch job, or testing a 70B model before deciding whether to buy hardware at all. A card you use a few hours a week is mostly an idle expense, and renting turns that fixed cost into a small variable one. The catch is that you pay while the pod runs, so you start it for the work and stop it after, rather than leaving it on.
The pricing is the part that makes it work for occasional jobs. As of June 2026, RunPod lists an RTX 4090 24GB at roughly $0.69 an hour, an A100 80GB at about $1.39, and an H100 80GB at about $2.89, with on-demand pods that come up in around 30 seconds. The one-click vLLM and Ollama templates run any Hugging Face model you point them at, including Llama, Qwen, DeepSeek, Gemma, and Phi, so there's no environment to build from scratch.
Run the comparison for your own case. An A100 at $1.39 an hour is about $33 for a full day, or roughly $1.39 for the one hour you actually needed. Against a $2,000-plus card that sits idle most of the week, renting wins for anything but constant, all-day use. If you do run inference around the clock, that's the point where owning the hardware starts to pay back, and the cost math in running models on your own machine shows where the line sits.
Params in billions × bytes per weight, plus 1–2GB.
Yes: quantize to Q4 and run it local.
Drop to a smaller or distilled model that fits.
Hosted API for no setup, or rent a GPU for full control.
So which fix is yours
Match the fix to the gap. If the model is just a little too big, quantize it to Q4 and run it on the card you have. If it's a tier or two over your card, drop to a smaller or distilled model that fits, since a 14B at Q4 beats a 70B you can't load at all. If the model is far past any single machine, like the full DeepSeek-V3 or R1, the question is control: reach for a hosted API when you want zero setup and per-token billing, and rent a cloud GPU when you want the actual weights on real hardware without buying it.
The one thing not to do is keep trying to force a model onto a card that can't hold it. The VRAM rule doesn't bend. But every model you've heard of has a path to running, whether that's a quantized download, a smaller sibling, an API call, or a $1.39 hour on a rented A100, and now you know which one fits your situation.