The pitch for self-hosting an open model is seductive and mostly wrong. You read that a rented A100 costs a buck-something an hour, you read that an open model is free to download, and you do the arithmetic in your head: surely that beats paying an API a few dollars per million tokens. It rarely does. The reason it rarely does is a number almost nobody puts in the spreadsheet, and once you put it in, the whole decision flips.
That number is utilization. A GPU costs the same per hour whether it's pinned at full load or sitting idle waiting for the next request, so the real cost of self-hosting isn't the hourly rate. It's the hourly rate divided by how much work you actually squeezed out of that hour. This piece walks the math straight through, names current prices on both sides, and marks the volume where the answer changes.
The one equation that settles it
Strip the decision to its core and there's a single comparison. Self-hosting beats an API when your cost to produce a token on your own GPU drops below what the API charges per token:
GPU $/hour ÷ (tokens/sec × 3,600 × utilization) < API $/token
The left side is what one token costs you to make. Take the GPU's hourly price, then divide by how many tokens that GPU actually delivered in an hour, which is its tokens-per-second times 3,600 seconds times the fraction of that hour you kept it busy. The right side is the API's posted rate. Everything that follows is just plugging real numbers into that line.
The throughput term is where people fool themselves first. A single A100 80GB serving a Llama-70B-class model through vLLM does roughly 1,000 to 3,000 output tokens per second when it's batching many requests together. Run one request on its own and that same card does about 40 to 60 tokens per second. The batched number is the one that makes self-hosting viable, and you only hit it when you have a steady stream of concurrent traffic to batch. No traffic, no batching, no economics.
The utilization term is where they fool themselves second, and worse. Drop the busy fraction to 10 percent and the per-token cost goes up tenfold, because you paid for the whole hour and used a tenth of it. A GPU at 10 percent load doesn't cost a little more per token than a GPU at full tilt. It costs about ten times as much.
What each side actually costs in June 2026
Start with the rented GPU. RunPod-class hourly rates right now sit at about $1.39 to $2 for an A100 80GB, $2.89 to $3.50 for an H100 80GB, and around $0.69 for a 24GB RTX 4090 (too little memory for a 70B model, but fine for smaller ones). Those are the on-demand numbers; committed and spot pricing run lower, and you can confirm the current board on RunPod's pricing page.
Now the API side, for the same class of open model. Cheap hosted providers like Groq, Together, and Fireworks run a Llama-70B-class model at roughly $0.59 to $0.90 per million tokens. Some go lower still: DeepInfra has landed around $0.12 to $0.23 per million. DeepSeek's own API sits near $0.27 to $0.55 per million. These are the prices self-hosting has to beat, and they're brutally low.
| Option | What it is | Posted price |
|---|---|---|
| RTX 4090 24GB | Rented GPU, small models only | ~$0.69/hr |
| A100 80GB | Rented GPU, serves a 70B model | $1.39–$2/hr |
| H100 80GB | Rented GPU, faster, pricier | $2.89–$3.50/hr |
| DeepInfra (open model) | Hosted API, per token | ~$0.12–$0.23/M |
| DeepSeek API | Hosted API, per token | ~$0.27–$0.55/M |
| Groq / Together / Fireworks | Hosted API, per token | ~$0.59–$0.90/M |
A worked example, at three volumes
Put real volumes through both sides. Take 1 billion tokens a month, a serious but not enormous workload. Through a cheap open API at $0.59 to $0.90 per million, that's about $600 to $1,040 a month, and you run no infrastructure for it. To serve that same load yourself with headroom and a bit of redundancy, you're running at least two A100s around the clock. Two A100s at 24/7 come to roughly $2,880 a month for the GPUs alone, and that's before you've paid anyone to set it up, watch it, and keep it patched.
So at 1 billion tokens a month, against a cheap open API, the rented GPU loses by a wide margin: about $2,880 against roughly $880, and the engineering bill hasn't even started. Self-hosting doesn't catch up against those cheap open APIs until you're pushing far more traffic through the same hardware, somewhere in the range of 10 to 50 billion tokens a month, and only if you keep that hardware genuinely busy. The fixed GPU cost stays flat while the API bill scales with usage, so there's a crossover. It just sits high.
| Monthly volume | Cheap open API (~$0.59–$0.90/M) | 2×A100, 24/7 (GPU only) | Cheaper option |
|---|---|---|---|
| 100M tokens | ~$59–$90 | ~$2,880 | API, by a mile |
| 1B tokens | ~$590–$900 | ~$2,880 | API |
| 10B tokens | ~$5,900–$9,000 | ~$2,880 | Self-host, if well-used |
The picture is completely different when you compare against a pricier proprietary API instead of a cheap open one. Frontier-tier closed models cost several dollars per million tokens rather than well under one, so the per-token side of the equation is much higher and the rented GPU breaks even far sooner, around 1 billion tokens a month rather than 10 to 50. The same hardware that loses badly to DeepInfra can beat a premium proprietary tier at a tenth of the volume. What you're really comparing against decides almost everything. For the closed-model rates, see benchr's verified pricing; for the cheapest hosted APIs across providers, see the cheapest LLM API ranking.
100M
API wins Not close. ~$90 vs ~$2,8801B
API wins ~$880 vs ~$2,880 GPU-only10–50B
Self-host If utilization stays high~1B
Self-host vs. a pricey proprietary tierThe costs that aren't on the rental invoice
The $2,880 is the easy part to count, and it's the part that flatters self-hosting, because it's only the GPU. The full bill includes the people and time around the GPU: DevOps to stand the cluster up, monitoring so you know when it falls over, updates when the model or the serving stack ships a breaking change, and the idle hours you pay for whenever traffic dips below capacity. Analysts who've tried to total all of it estimate the true cost runs about 3 to 5 times the raw GPU line.
Treat that 3-to-5× as a rule of thumb, not an audited figure. The real multiplier depends on how good your team already is at running GPU infrastructure and how steady your traffic is. But the direction is not in doubt: the sticker price on the GPU is the floor, not the number, and a break-even calculation that stops at the hourly rate is telling you a comforting story rather than a true one. The local-machine analysis walks the same hidden-cost trap on your own hardware rather than a rented one.
The middle path: rent the GPU only when it runs
There's a real option between an always-on rented GPU and a per-token API, and it goes straight at the utilization problem. Serverless GPU bills per millisecond and scales to zero, so when nothing is running you pay nothing. For a bursty or low-volume workload, that's the whole game: instead of paying for an idle A100 for the 90 percent of the day it has nothing to do, you pay only for the milliseconds it's actually generating tokens.
The honest catch is cold starts. A large model has to load into GPU memory before it can answer the first request, and for a 70B-class model that load is not instant. You can erase the delay by keeping a warm worker always ready, but a warm worker is a GPU that's always on, which costs about what an always-on rental costs and gives back the idle-time savings. So serverless is the right tool when your traffic is spiky and you can tolerate the occasional slow first response, and the wrong tool when you need steady low latency at all hours.
Always-on rented GPU, batched. Utilization stays high, per-token cost drops.
Serverless GPU, scales to zero. Pay per millisecond, eat the cold start.
Per-token API. No infra, no idle bill, no cold start to manage.
So which one
Default to the API, and stay there longer than your gut says. Under 1 billion tokens a month, against any cheap open-model API, self-hosting on rented GPUs is a worse deal once you count the engineering, and it's a worse deal even before you count it. The per-token price you'd have to beat is so low that a fully-utilized GPU struggles to match it.
Self-host when two things are both true: your volume is high enough to clear the break-even band (roughly 10 to 50 billion tokens a month against cheap open APIs, far less against premium proprietary ones), and your traffic is steady enough to keep the hardware genuinely busy. If volume is there but traffic is spiky, reach for serverless GPU before an always-on cluster, so the idle hours stop billing. And if you're comparing against an expensive proprietary tier rather than a cheap open one, run the numbers again from scratch, because that break-even sits much lower and self-hosting gets attractive much sooner. To put your own token counts and prices through it, the cost calculator does the arithmetic, and the price-per-use-case table shows how the bill shifts by workload shape.