Renting a GPU vs. paying per token: when self-hosting an open model is actually cheaper

The honest break-even math — GPU $/hour against API $/million-tokens, and why utilization, not the sticker price, decides it.

By the benchr team · · View changelog

2×A100, 24/7 $2,880 GPU bill per month, before engineering
1B tokens, cheap API ~$600–$1,040 Per month, no infra to run
Break-even band 10–50B Tokens/month vs. a cheap open API
Cost of 10% load 10× Per-token penalty for an idle GPU

The pitch for self-hosting an open model is seductive and mostly wrong. You read that a rented A100 costs a buck-something an hour, you read that an open model is free to download, and you do the arithmetic in your head: surely that beats paying an API a few dollars per million tokens. It rarely does. The reason it rarely does is a number almost nobody puts in the spreadsheet, and once you put it in, the whole decision flips.

That number is utilization. A GPU costs the same per hour whether it's pinned at full load or sitting idle waiting for the next request, so the real cost of self-hosting isn't the hourly rate. It's the hourly rate divided by how much work you actually squeezed out of that hour. This piece walks the math straight through, names current prices on both sides, and marks the volume where the answer changes.

The one equation that settles it

Strip the decision to its core and there's a single comparison. Self-hosting beats an API when your cost to produce a token on your own GPU drops below what the API charges per token:

GPU $/hour ÷ (tokens/sec × 3,600 × utilization) < API $/token

The left side is what one token costs you to make. Take the GPU's hourly price, then divide by how many tokens that GPU actually delivered in an hour, which is its tokens-per-second times 3,600 seconds times the fraction of that hour you kept it busy. The right side is the API's posted rate. Everything that follows is just plugging real numbers into that line.

The throughput term is where people fool themselves first. A single A100 80GB serving a Llama-70B-class model through vLLM does roughly 1,000 to 3,000 output tokens per second when it's batching many requests together. Run one request on its own and that same card does about 40 to 60 tokens per second. The batched number is the one that makes self-hosting viable, and you only hit it when you have a steady stream of concurrent traffic to batch. No traffic, no batching, no economics.

The utilization term is where they fool themselves second, and worse. Drop the busy fraction to 10 percent and the per-token cost goes up tenfold, because you paid for the whole hour and used a tenth of it. A GPU at 10 percent load doesn't cost a little more per token than a GPU at full tilt. It costs about ten times as much.

What each side actually costs in June 2026

Start with the rented GPU. RunPod-class hourly rates right now sit at about $1.39 to $2 for an A100 80GB, $2.89 to $3.50 for an H100 80GB, and around $0.69 for a 24GB RTX 4090 (too little memory for a 70B model, but fine for smaller ones). Those are the on-demand numbers; committed and spot pricing run lower, and you can confirm the current board on RunPod's pricing page.

Now the API side, for the same class of open model. Cheap hosted providers like Groq, Together, and Fireworks run a Llama-70B-class model at roughly $0.59 to $0.90 per million tokens. Some go lower still: DeepInfra has landed around $0.12 to $0.23 per million. DeepSeek's own API sits near $0.27 to $0.55 per million. These are the prices self-hosting has to beat, and they're brutally low.

What a rented GPU costs per hour vs. what a hosted open-model API charges per million tokens, June 2026. RunPod-class on-demand rates.
OptionWhat it isPosted price
RTX 4090 24GBRented GPU, small models only~$0.69/hr
A100 80GBRented GPU, serves a 70B model$1.39–$2/hr
H100 80GBRented GPU, faster, pricier$2.89–$3.50/hr
DeepInfra (open model)Hosted API, per token~$0.12–$0.23/M
DeepSeek APIHosted API, per token~$0.27–$0.55/M
Groq / Together / FireworksHosted API, per token~$0.59–$0.90/M

A worked example, at three volumes

Put real volumes through both sides. Take 1 billion tokens a month, a serious but not enormous workload. Through a cheap open API at $0.59 to $0.90 per million, that's about $600 to $1,040 a month, and you run no infrastructure for it. To serve that same load yourself with headroom and a bit of redundancy, you're running at least two A100s around the clock. Two A100s at 24/7 come to roughly $2,880 a month for the GPUs alone, and that's before you've paid anyone to set it up, watch it, and keep it patched.

$2,880 Two A100s running 24/7 for a month — the GPU bill alone, against ~$880 for 1B tokens on a cheap API

So at 1 billion tokens a month, against a cheap open API, the rented GPU loses by a wide margin: about $2,880 against roughly $880, and the engineering bill hasn't even started. Self-hosting doesn't catch up against those cheap open APIs until you're pushing far more traffic through the same hardware, somewhere in the range of 10 to 50 billion tokens a month, and only if you keep that hardware genuinely busy. The fixed GPU cost stays flat while the API bill scales with usage, so there's a crossover. It just sits high.

Monthly cost by volume: a cheap open-model API vs. two A100s run 24/7. The GPU bill is fixed; the API bill scales with tokens. Engineering and idle-time costs are extra and not shown here.
Monthly volumeCheap open API (~$0.59–$0.90/M)2×A100, 24/7 (GPU only)Cheaper option
100M tokens~$59–$90~$2,880API, by a mile
1B tokens~$590–$900~$2,880API
10B tokens~$5,900–$9,000~$2,880Self-host, if well-used

The picture is completely different when you compare against a pricier proprietary API instead of a cheap open one. Frontier-tier closed models cost several dollars per million tokens rather than well under one, so the per-token side of the equation is much higher and the rented GPU breaks even far sooner, around 1 billion tokens a month rather than 10 to 50. The same hardware that loses badly to DeepInfra can beat a premium proprietary tier at a tenth of the volume. What you're really comparing against decides almost everything. For the closed-model rates, see benchr's verified pricing; for the cheapest hosted APIs across providers, see the cheapest LLM API ranking.

100M

API wins Not close. ~$90 vs ~$2,880

1B

API wins ~$880 vs ~$2,880 GPU-only

10–50B

Self-host If utilization stays high

~1B

Self-host vs. a pricey proprietary tier

The costs that aren't on the rental invoice

The $2,880 is the easy part to count, and it's the part that flatters self-hosting, because it's only the GPU. The full bill includes the people and time around the GPU: DevOps to stand the cluster up, monitoring so you know when it falls over, updates when the model or the serving stack ships a breaking change, and the idle hours you pay for whenever traffic dips below capacity. Analysts who've tried to total all of it estimate the true cost runs about 3 to 5 times the raw GPU line.

Treat that 3-to-5× as a rule of thumb, not an audited figure. The real multiplier depends on how good your team already is at running GPU infrastructure and how steady your traffic is. But the direction is not in doubt: the sticker price on the GPU is the floor, not the number, and a break-even calculation that stops at the hourly rate is telling you a comforting story rather than a true one. The local-machine analysis walks the same hidden-cost trap on your own hardware rather than a rented one.

The middle path: rent the GPU only when it runs

There's a real option between an always-on rented GPU and a per-token API, and it goes straight at the utilization problem. Serverless GPU bills per millisecond and scales to zero, so when nothing is running you pay nothing. For a bursty or low-volume workload, that's the whole game: instead of paying for an idle A100 for the 90 percent of the day it has nothing to do, you pay only for the milliseconds it's actually generating tokens.

The honest catch is cold starts. A large model has to load into GPU memory before it can answer the first request, and for a 70B-class model that load is not instant. You can erase the delay by keeping a warm worker always ready, but a warm worker is a GPU that's always on, which costs about what an always-on rental costs and gives back the idle-time savings. So serverless is the right tool when your traffic is spiky and you can tolerate the occasional slow first response, and the wrong tool when you need steady low latency at all hours.

Steady high traffic

Always-on rented GPU, batched. Utilization stays high, per-token cost drops.

Bursty or low traffic

Serverless GPU, scales to zero. Pay per millisecond, eat the cold start.

Low or unpredictable volume

Per-token API. No infra, no idle bill, no cold start to manage.

So which one

Default to the API, and stay there longer than your gut says. Under 1 billion tokens a month, against any cheap open-model API, self-hosting on rented GPUs is a worse deal once you count the engineering, and it's a worse deal even before you count it. The per-token price you'd have to beat is so low that a fully-utilized GPU struggles to match it.

Self-host when two things are both true: your volume is high enough to clear the break-even band (roughly 10 to 50 billion tokens a month against cheap open APIs, far less against premium proprietary ones), and your traffic is steady enough to keep the hardware genuinely busy. If volume is there but traffic is spiky, reach for serverless GPU before an always-on cluster, so the idle hours stop billing. And if you're comparing against an expensive proprietary tier rather than a cheap open one, run the numbers again from scratch, because that break-even sits much lower and self-hosting gets attractive much sooner. To put your own token counts and prices through it, the cost calculator does the arithmetic, and the price-per-use-case table shows how the bill shifts by workload shape.

Frequently asked

Is it cheaper to self-host Llama 70B or use an API?

For most workloads, the API. A cheap open-model API runs about $0.59 to $0.90 per million tokens, and some go as low as $0.12 to $0.23, so 1 billion tokens a month costs roughly $600 to $1,040. Two A100s running around the clock cost about $2,880 a month for the GPUs alone, before any engineering. Self-hosting only catches up against cheap open APIs at high, well-used volume, somewhere around 10 to 50 billion tokens a month. Against pricier proprietary tiers the break-even arrives far sooner, near 1 billion a month.

What GPU do I need to serve a 70B model?

A single A100 80GB can serve a Llama-70B-class model through vLLM, doing roughly 1,000 to 3,000 output tokens per second when requests are batched, but only about 40 to 60 tokens per second for one request on its own. An H100 80GB is faster and costs more. For real throughput and redundancy, production setups usually run more than one card. June 2026 hourly rates run about $1.39 to $2 for an A100, $2.89 to $3.50 for an H100, and around $0.69 for a 24GB RTX 4090.

Does serverless GPU save money?

For bursty or low-utilization workloads, yes. Serverless GPU bills per millisecond and scales to zero, so you stop paying when nothing is running, which is exactly the idle-time problem that sinks an always-on rented GPU. The catch is cold starts: a large model has to load before it can answer, and the only way to remove that delay is to keep a warm worker, which costs about the same as running the GPU all the time. It's the honest middle path between an always-on rented GPU and a per-token API.

What's the break-even token volume for self-hosting?

It depends entirely on what you compare against. Versus a cheap open-model API at $0.12 to $0.90 per million tokens, a well-utilized self-host starts winning somewhere around 10 to 50 billion tokens a month. Versus a pricier proprietary API, the break-even can arrive near 1 billion tokens a month. The number that moves it most is utilization: a GPU running at 10 percent load costs about ten times as much per token as the same GPU run flat out.

Changelog

  • June 19, 2026 — Originally published. GPU hourly rates, hosted open-model API prices, and throughput figures verified against RunPod's pricing and serverless documentation and current provider rates; the 3-to-5× hidden-cost multiplier is flagged as an analyst estimate, not an audited figure.

References

  1. RunPod, "GPU instance pricing," runpod.io/gpu-instance/pricing, accessed June 2026.
  2. RunPod, "Serverless overview," docs.runpod.io/serverless/overview, accessed June 2026.
  3. TokenMix, "Self-host LLM vs API: a break-even analysis," tokenmix.ai/blog, accessed June 2026.
  4. AI Pricing Guru, "Together pricing reference," aipricing.guru/together-pricing, accessed June 2026.