“CUDA out of memory”: why it happens, and how to run a model that's too big for your card

The five real causes of the GPU OOM error, the fixes in order from free to last-resort, and when you simply need a bigger card.

By the benchr team · · View changelog

Usual causes 5 Weights, KV cache, batch, fragmentation, no quant
KV cache example ~16GB Llama-3-8B, 8K context, batch 16 — on top of weights
Quantization saves 50–75% FP16 weights down to 4-bit
First command nvidia-smi See what's actually using the card

You tried to load a model or stand up a server, and the run died on a wall of red text ending in torch.cuda.OutOfMemoryError: CUDA out of memory. It's the most common failure people hit when they move from an API to running weights themselves, and the message is more honest than it looks. The GPU asked for memory it didn't have. The useful question isn't whether you're out of memory. You are. It's which piece of the workload ate it, because the fix depends entirely on that.

Before changing anything, run nvidia-smi. It shows total VRAM, what's used right now, and which processes hold it. That one command tells you whether another job is squatting on the card, whether you're a little over or wildly over, and whether the number even makes sense for the model you're loading. Everything below assumes you've looked.

The five things that eat your VRAM

GPU memory during inference goes to three buckets: the model weights, the KV cache that holds attention state for every token in flight, and the activations for the current forward pass. An out-of-memory error is one of those three growing past what the card has. Here's how that breaks down in practice, with the symptom that points at each one and the fix that actually targets it.

The five causes of a CUDA out-of-memory error, the symptom that fingers each, and the flag that fixes it. Flags shown are vLLM
CauseSymptomFixExact flag
Weights exceed VRAMOOM the instant the model loads, before any requestQuantize the weights--quantization awq
KV cache too big (long context)Loads fine, then OOM on the first long prompt or under loadCap the context length--max-model-len 4096
Batch size too largeOOM only when several requests run at onceCut concurrency--max-num-seqs 16
Memory fragmentationOOM even though nvidia-smi shows free memoryDisable CUDA graphs--enforce-eager
No quantizationA model that should fit is running in FP16/FP32Load a 4-bit/8-bit build--quantization awq

Take the first one literally. A 70B model in FP16 needs about 140GB of VRAM for weights alone: two bytes per parameter, 70 billion of them. No single consumer card, and no single 80GB data-center card, holds that. A common version of this trap is subtler: you meant to load a 4-bit build and the runtime loaded the full BF16 weights instead, so a model you expected to fit at 40GB tries to claim 140GB and dies on load. If the OOM happens before you've sent a single request, it's the weights, and quantization is the answer.

The KV cache is the part people miss

The second cause is the one that catches careful people, because the model loads cleanly and then falls over later. The culprit is the KV cache. A server like vLLM pre-allocates KV cache for the worst case it's been told to expect: max_model_len multiplied by the maximum batch it might serve. That reservation can rival the weights themselves.

Concrete numbers make it real. Llama-3-8B in FP16 is about 16GB of weights. Serve it at an 8K context with a batch of 16, and the KV cache for that worst case runs to roughly 16GB on its own — so the server needs about 32GB before it has answered anything, on a workload whose weights are only half that. On a 24GB card it loads and then dies the moment real traffic arrives. This is why the same model "works on my laptop" and OOMs in production: the batch and context are bigger in production.

~16GB KV cache for Llama-3-8B at 8K context, batch 16 — roughly equal to the weights, and reserved on top of them

The fix that targets this directly is --max-model-len. Set it to what you actually serve — --max-model-len 4096 if your prompts never exceed 4K — and the reservation shrinks in proportion. This matters for a reason that trips up almost everyone: quantization shrinks the weights but does nothing for the KV cache. If your OOM is KV-driven, you can quantize all day and still run out. Cutting the context is the more effective move. On Hopper-class cards you can also store the cache itself in FP8 with --kv-cache-dtype fp8, which trims KV memory by about 40–50% with little quality cost, as the vLLM KV-cache writeups document.

The fixes, in order from free to last resort

Don't reach for the big hammer first. Work down this list and stop at the first step that fits the model on your card. Every step here costs nothing but a config change or a different model file — the only thing you spend is a little speed.

1. Quantize the weights

--quantization awq or a GGUF Q4 build. Cuts weight memory 50–75%.

2. Cap the context

--max-model-len 4096. Attacks the KV cache directly — usually the biggest win.

3. Cut concurrency

--max-num-seqs 16. Fewer simultaneous sequences, smaller activations and cache.

4. Free fragmentation

--enforce-eager drops CUDA graphs, frees ~1.5–2GB, ends fragmentation OOMs.

5. Reserve headroom, then offload

--gpu-memory-utilization 0.92, then --cpu-offload-gb 4 as the slow last resort.

Quantize first. Loading AWQ or GPTQ 4-bit weights, or a GGUF Q4 build, cuts weight memory by 50–75% against FP16 for a quality drop most workloads never notice. In vLLM that's --quantization awq pointed at a pre-quantized checkpoint; with llama.cpp it's just downloading the Q4 file instead of the full one. If a model is a touch too big, this alone usually fixes it. A smaller-but-good model is often the cleaner answer than fighting a large one. That's the case small language models makes in detail.

Then cut the context. If the model loads and OOMs later, this is your lever, not quantization. --max-model-len 4096 sizes the KV reservation to a context you'll actually use. Pair it with --kv-cache-dtype fp8 on a Hopper card for another 40–50% off the cache.

Then cut concurrency. --max-num-seqs 16 caps how many sequences run at once. Activations and KV both scale with the batch, so a high default on a small card is a common, quiet cause of OOM under load.

Then deal with fragmentation. If nvidia-smi shows free memory and you still OOM, the free space isn't contiguous. CUDA graph capture is the usual cause, and --enforce-eager disables it, freeing roughly 1.5–2GB and removing the fragmentation. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in the environment helps PyTorch reuse fragmented blocks too. Eager mode is a little slower per token, which is the trade.

Then reserve headroom. vLLM grabs 90% of VRAM by default. On a card that's also driving a display or another process, that overshoots. --gpu-memory-utilization 0.92 tunes the share — lower it if something else needs the card, raise it carefully if the GPU is dedicated.

Offload only as a last resort. --cpu-offload-gb 4 spills that many gigabytes of weights to system RAM so a model that's slightly too big still runs. It works, but every offloaded layer crosses the PCIe bus on each token, so throughput drops hard. Use it to get unblocked, not as a steady state.

When you just need a bigger GPU

Sometimes the model is simply bigger than any card you own, and no amount of tuning closes the gap. A 70B model at 4-bit still wants about 40GB for weights plus room for the cache; that won't fit on a 24GB card no matter how short you make the context. At that point you've left the realm of config fixes. The honest choice is between buying a bigger card and renting one.

For an occasional job — a fine-tune, a batch run, a week of evaluation — renting wins on math. An 80GB A100 runs about $1.39 an hour and an H100 about $2.89, and an 80GB card holds a 70B model at 4-bit with the KV cache to spare. Compare that to thousands of dollars for hardware you'd use intermittently. If you'll be running heavy local inference daily and indefinitely, owning the card changes the equation — the running models on your own machine guide works through where that crossover sits. To pick a model sized to the card you've got, the model index lists context windows and parameter counts side by side.

The wider point: a CUDA out-of-memory error is a sizing problem, not a dead end. Read nvidia-smi to see which bucket overflowed, quantize and cut context to claw back the easy gigabytes, clear fragmentation with eager mode, and only rent a bigger card once the model genuinely doesn't fit anything you have. Most of the time you never get to that last step.

Frequently asked

How do I fix CUDA out of memory in vLLM?

Work down the list in order. Quantize the weights (--quantization awq, or load a GGUF Q4 build) to cut weight memory by 50–75%. Cap the context with --max-model-len 4096 to shrink the KV cache, which is often the bigger win. Cut concurrency with --max-num-seqs 16. Add --enforce-eager to free the CUDA-graph memory and stop fragmentation. Reserve headroom with --gpu-memory-utilization 0.92. Only after all of that, offload with --cpu-offload-gb 4 or rent a bigger card.

Does quantization fix CUDA out of memory?

Partly. Quantization shrinks the model weights, often by 50–75% going from FP16 to 4-bit, so it helps a lot when the weights are what overflow your card. But it does not shrink the KV cache, which scales with context length and batch size. If your OOM comes from a long max_model_len, cutting --max-model-len is the more effective fix than quantizing.

Why do I get CUDA out of memory when nvidia-smi shows free memory?

That's almost always fragmentation, usually from CUDA graph capture. The free memory exists but not as one contiguous block large enough for the allocation. Add --enforce-eager to disable CUDA graphs, which frees roughly 1.5–2GB and removes the fragmentation. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True can also help PyTorch reuse fragmented space.

What GPU do I need to run a 70B model?

A 70B model in FP16 needs about 140GB just for weights, so it doesn't fit on a single card. At 4-bit quantization it drops to roughly 40GB, which fits on one 80GB A100 or H100 with room for the KV cache, or on two 24GB cards. If you don't own an 80GB card, renting one by the hour (an A100 80GB runs about $1.39/hr, an H100 80GB about $2.89/hr) is cheaper than buying hardware for an occasional job.

Changelog

  • June 19, 2026 — Originally published.

References

  1. "How to fix vLLM out of memory errors," markaicode.com, accessed June 2026.
  2. "Fix vLLM out of memory: KV cache," gigagpu.com, accessed June 2026.
  3. RunPod, "GPU instance pricing," runpod.io/gpu-instance/pricing, accessed June 2026.
  4. RunPod, "Get started with the vLLM worker," docs.runpod.io, accessed June 2026.