Here's the situation that sends most people down a dead end. You've got a few thousand examples of how you want a model to answer: support replies in your house voice, structured extractions in your schema, a tone that the stock model never quite hits. You want to bake that in with a fine-tune. Then you check what a fine-tune needs and the GPU you own falls short by a mile. The usual conclusion is that you have to spend thousands on hardware first. You don't. The honest answer is that you rent a capable GPU for a few hours, pay a handful of dollars, and shut it down when the job's done.
The thing blocking you is memory, specifically VRAM. So it's worth being precise about why your machine can't do the job, because the same precision tells you exactly how much rented GPU the work needs.
Why your own machine usually can't do it
A full fine-tune updates every weight in the model, and that costs far more memory than just running the model. You're holding the weights, plus the gradients for every weight, plus the optimizer states. The Adam optimizer alone keeps two extra numbers per weight. Add it up and a full fine-tune wants roughly four times the memory that inference does. A 7B model that you can chat with in about 14GB needs on the order of 60GB to fully fine-tune. A 70B model blows past 480GB, which means a rack of GPUs, not a workstation.
Now look at what you can buy. Consumer GPUs top out at 24GB of VRAM. That gap, 24GB on hand against 60GB needed for the smallest serious model, is the whole problem. No amount of patience fixes it, because the run simply won't fit in memory and crashes before it starts. This is the same wall described in running models on your own machine, except training makes it taller, since inference is the cheap part and fine-tuning is the expensive one.
LoRA and QLoRA change the arithmetic
The fix isn't a bigger machine. It's training less of the model. LoRA, short for low-rank adaptation, freezes the entire base model and trains only small adapter layers bolted onto it. You update a tiny fraction of the parameters, so the gradients and optimizer states shrink to almost nothing. QLoRA goes one step further: it quantizes the frozen base down to 4-bit and trains those adapters in 16-bit on top. The base barely takes up room because it's compressed, and the trainable part is small to begin with.
The VRAM numbers tell the story. A 7B QLoRA run fits in about 5–6GB, which a consumer card can hold without trouble. A 13B run needs roughly 10–12GB. And a 70B QLoRA run comes in around 48GB, which fits on a single A100 80GB with headroom to spare. That last number is the one that matters: the model you'd need 480GB to fully fine-tune trains on one rented GPU once you switch to QLoRA.
There's a cost to compressing the base, and it's fair to name it. QLoRA lands somewhere around 80 to 90% of the quality of a full fine-tune on the same data. A 16-bit LoRA, which skips the 4-bit step, gets closer, roughly 90 to 95%, at the price of more VRAM. For most jobs, where you're teaching format, tone, or a domain rather than chasing a benchmark, QLoRA is the right starting point. Move up to 16-bit LoRA only if you've run a QLoRA pass and decided you need the last few points. The underlying technique is documented in the QLoRA paper and Hugging Face's PEFT library, the toolkit most of these training scripts call under the hood.
| Method | VRAM, 7B | VRAM, 70B | Realistic on a rented GPU? |
|---|---|---|---|
| Full fine-tune | ~60 GB | 480 GB+ | 7B yes on a big card; 70B needs a multi-GPU rack |
| 16-bit LoRA | ~16 GB | ~130 GB | 7B easily; 70B needs two 80GB cards |
| QLoRA (4-bit base) | 5–6 GB | ~48 GB | Yes, both fit one A100 80GB |
The model you'd need a 480GB rack to fully fine-tune trains on one rented GPU once you switch to QLoRA. That single fact is what turns "buy a server" into "rent a card for an afternoon," and it's why the hardware question mostly goes away.
Rent a GPU to run the job
Once the job fits in 48GB, you don't need to own anything. Several cloud providers rent GPUs by the hour, you spin one up only for the training run, and you tear it down the moment it finishes. The cost is just hours times the hourly rate, and for a fine-tune that's a small number.
Concrete figures, current as of June 2026. A 7B QLoRA run on roughly 5,000 samples for three epochs takes about three-quarters of an hour to an hour and a half, which works out to $1–$3 on an A100. Push that same 7B model on 50,000 samples and you're looking at 4–6 hours, or about $6–$12. A 70B QLoRA run on 5,000 samples takes 2–4 hours, around $7–$12 on an A100 80GB. The big one, a 70B model on 50,000 samples, runs 12–20 hours and lands somewhere between $25 and $60. Those times move with sequence length and dataset size, so treat them as the shape of the cost, not a quote.
7B · 5K
$1–$3 ~0.75–1.5 hrs, A1007B · 50K
$6–$12 ~4–6 hrs, A10070B · 5K
$7–$12 ~2–4 hrs, A100 80GB70B · 50K
$25–$60 ~12–20 hrs, A100 80GBWhat does the GPU itself cost per hour? On RunPod, one option for this, an on-demand A100 80GB runs about $1.39 an hour and an H100 about $2.89. For a QLoRA fine-tune the A100 is plenty: the job fits, and the extra speed of an H100 rarely earns back its higher rate on a run this short. The provider also documents a fine-tuning path built on Axolotl, an open-source trainer that ships ready QLoRA configs, including an 8B QLoRA example you can adapt. You point it at your dataset in a standard format (chat, alpaca, or sharegpt), let it run, and push the finished adapter to Hugging Face when it's done. New accounts get $5 in signup credit, which covers a small 7B run outright.
The steps, start to finish
The shape of the job is the same no matter which provider you rent from. Five steps, and none of them needs hardware you own.
A few thousand examples in chat, alpaca, or sharegpt format.
One A100 80GB by the hour. Pick a QLoRA template.
Axolotl with a ready config. A few hours for most jobs.
Save the adapter to Hugging Face, then kill the instance.
Step one is where the result is won or lost. A few thousand clean, consistent examples beat tens of thousands of sloppy ones, and the format just needs to match what the trainer expects. Steps two and three are mechanical once you've picked a QLoRA config, since most of the work is waiting. Step four matters for your wallet: an idle rented GPU still bills by the hour, so push the trained adapter to Hugging Face and tear the instance down the moment the run finishes. Forgetting to shut it off is the one way a cheap fine-tune turns into an expensive one.
A fine-tune isn't only a quality move, either. A model taught your format and your house rules needs far less instruction crammed into every prompt, which trims the tokens you send on each call. If your costs live in the prompt, that's a real saving, and the mechanics are in cutting your token bill.
When to skip the rented GPU entirely
Renting a card is the right call when you want control over the data, the method, and the resulting weights, and when you're comfortable running a training script. It isn't the only path. Some providers offer managed fine-tuning APIs: you upload your dataset, they handle the hardware and the training loop, and you get a fine-tuned model back through their API. That skips the infrastructure completely, which is genuinely nice if you never want to touch a GPU. The trade is that it costs more than renting raw compute, you're limited to the models that provider supports, and you give up the fine-grained control and portable open weights that QLoRA on your own rented box hands you.
So the decision is straightforward. If you want an open model, your own weights, and the lowest cost, rent a GPU and run QLoRA. If you'd rather pay a premium to never see a terminal, a managed API does the job. Either way, the old assumption that you must buy a GPU before you can fine-tune anything is just wrong. To choose which open model to start from, see the open-weight tier right now, and the full lineup with prices and benchmarks lives on the models index.