RAG vs Fine-Tuning: An Honest Decision Framework

Photo by Google DeepMind

Photo by Google DeepMind
Every few weeks someone asks me whether they should fine-tune a model for their product, and in most cases the honest answer is: you do not have a fine-tuning problem, you have a retrieval problem — or no problem at all that a better prompt would not fix. The opposite mistake exists too: teams building elaborate RAG pipelines to teach a model a tone of voice, which retrieval fundamentally cannot do.
I have already written a hands-on guide to building production RAG pipelines on this blog, so this post deliberately stays at the decision level: what each technique actually changes, the questions that decide between them, and where the hybrid answer is correct. The framing is opinionated because the costs of choosing wrong are asymmetric and real.
The cleanest mental model I know: RAG changes what the model can see, fine-tuning changes how the model behaves. Almost every decision falls out of that one sentence.
RAG: knowledge injection at request time
Retrieval fetches relevant documents and places them in the context window per request. Knowledge stays outside the model, so it can be updated in seconds, scoped per user or tenant, audited, deleted, and cited. The model itself is untouched. Anthropic's contextual retrieval work shows how far the technique scales: adding chunk context before embedding cut retrieval failures by 49 percent, and 67 percent with reranking.
Fine-tuning: behavior baked into weights
Fine-tuning adjusts model weights from example pairs. It is how you change format compliance, tone, domain jargon, and task-specific reflexes — effectively more examples than any context window could hold, per OpenAI's own framing. What it is bad at: storing facts. Weights are a lossy, unauditable, undeletable place to keep knowledge that changes.
Map your actual requirement to the left column and the answer mostly picks itself:
| You need the model to... | Reach for | Why |
|---|---|---|
| Answer from your documents, wikis, or database | RAG | Knowledge changes, needs citations, and must be tenant-scoped |
| Always answer with current information | RAG | Weights are frozen at training time; retrieval is live |
| Follow a strict output format or house style | Fine-tuning | Behavior and style live in weights, not in retrieved text |
| Master domain shorthand a base model fumbles | Fine-tuning | Consistent interpretation needs training examples, not lookups |
| Cut cost or latency on one narrow, high-volume task | Fine-tuning | A tuned small model can replace a large one with shorter prompts |
| Comply with right-to-be-forgotten or per-customer data isolation | RAG | You can delete a document from an index; you cannot delete it from weights |
When the table is not enough, these four questions settle it. I run every internal AI feature request through them:
The most expensive myth: fine-tuning as knowledge storage. Teams fine-tune on their docs expecting the model to know them. The result is a model that confidently sounds like the docs while hallucinating their contents — worse than RAG on day one and decaying from there as the docs evolve. If the requirement contains the word know, the answer is retrieval.
The decision is not only about capability — it is about what you are signing up to operate. I run infrastructure for a living, so this is the part I weigh hardest.
What each path costs you after launch:
The strongest production systems I have seen combine them with a clean division of labor: fine-tune for the consistent skin — output format, tone, domain conventions — and use RAG for every fact the answer depends on. A support assistant is the classic case: tuned to sound like your team and follow your escalation format, while every product detail comes from a retrieved, current knowledge base entry.
But hybrid is a graduation, not a starting point. Sequence it: prompt engineering with a proper eval set first, add RAG when knowledge is the gap, and fine-tune last, only when format or style failures persist on top of working retrieval. Each stage de-risks the next, and most products are genuinely done after the second.
RAG changes what the model sees; fine-tuning changes how it behaves. Knowledge that changes, needs citing, or must be deletable lives in retrieval. Behavior you want every single time lives in weights — once you have the data to prove it and the evals to protect it. Start with the prompt, add retrieval when knowledge is the gap, and treat fine-tuning as the specialized tool it is rather than the default it gets marketed as.
Sources and further reading