What is the core difference between RAG and fine-tuning?

RAG changes what the model can see by injecting relevant documents into the context window at request time, while fine-tuning changes how the model behaves by adjusting its weights from example pairs. Knowledge stored via RAG can be updated in seconds, audited, and deleted; knowledge baked into weights is lossy and cannot be removed after training.

When should I choose RAG over fine-tuning?

Choose RAG whenever the model needs to answer from documents, wikis, or a database that changes frequently, requires citations, or must be scoped per user or tenant. Because model weights are frozen at training time, any knowledge that updates daily or weekly must live in retrieval — a fine-tuned model goes stale the moment your underlying data changes.

Why is using fine-tuning to store facts a mistake?

Weights are a lossy, unauditable, and undeletable place to keep knowledge. Teams that fine-tune on their documents often end up with a model that confidently sounds like those docs while hallucinating their actual contents — worse than RAG from day one and increasingly unreliable as the documents evolve.

What does the post recommend as the correct sequencing for building an AI feature?

The post recommends starting with prompt engineering against a proper eval set, then adding RAG when a knowledge gap is the bottleneck, and only fine-tuning last — and only when format or style failures persist on top of already-working retrieval. Most products are genuinely done after the retrieval stage.

What operational costs does each approach carry after launch?

RAG requires maintaining an embedding pipeline, keeping an index in sync with source documents, monitoring retrieval quality, and revisiting chunking decisions — all manageable with normal engineering. Fine-tuning adds dataset curation and versioning, training runs, an eval harness to catch regressions and catastrophic forgetting, redeployment for every knowledge or behavior update, and the need to re-run everything when the base model is deprecated.

RAG vs Fine-Tuning: An Honest Decision Framework

Every few weeks someone asks me whether they should fine-tune a model for their product, and in most cases the honest answer is: you do not have a fine-tuning problem, you have a retrieval problem — or no problem at all that a better prompt would not fix. The opposite mistake exists too: teams building elaborate RAG pipelines to teach a model a tone of voice, which retrieval fundamentally cannot do.

I have already written a hands-on guide to building production RAG pipelines on this blog, so this post deliberately stays at the decision level: what each technique actually changes, the questions that decide between them, and where the hybrid answer is correct. The framing is opinionated because the costs of choosing wrong are asymmetric and real.

What Each Technique Actually Changes

The cleanest mental model I know: RAG changes what the model can see, fine-tuning changes how the model behaves. Almost every decision falls out of that one sentence.

RAG: knowledge injection at request time

Retrieval fetches relevant documents and places them in the context window per request. Knowledge stays outside the model, so it can be updated in seconds, scoped per user or tenant, audited, deleted, and cited. The model itself is untouched. Anthropic's contextual retrieval work shows how far the technique scales: adding chunk context before embedding cut retrieval failures by 49 percent, and 67 percent with reranking.

Fine-tuning: behavior baked into weights

Fine-tuning adjusts model weights from example pairs. It is how you change format compliance, tone, domain jargon, and task-specific reflexes — effectively more examples than any context window could hold, per OpenAI's own framing. What it is bad at: storing facts. Weights are a lossy, unauditable, undeletable place to keep knowledge that changes.

The Decision Table

Map your actual requirement to the left column and the answer mostly picks itself:

You need the model to...	Reach for	Why
Answer from your documents, wikis, or database	RAG	Knowledge changes, needs citations, and must be tenant-scoped
Always answer with current information	RAG	Weights are frozen at training time; retrieval is live
Follow a strict output format or house style	Fine-tuning	Behavior and style live in weights, not in retrieved text
Master domain shorthand a base model fumbles	Fine-tuning	Consistent interpretation needs training examples, not lookups
Cut cost or latency on one narrow, high-volume task	Fine-tuning	A tuned small model can replace a large one with shorter prompts
Comply with right-to-be-forgotten or per-customer data isolation	RAG	You can delete a document from an index; you cannot delete it from weights

Four Questions Before You Decide

When the table is not enough, these four questions settle it. I run every internal AI feature request through them:

Does the failure come from missing knowledge or wrong behavior? Ask the model the question with the relevant document pasted in. If it answers well, you have a retrieval problem. If it still rambles or misformats, you have a behavior problem.
How often does the underlying information change? Daily or weekly means RAG, full stop. A fine-tune is stale the moment your price list changes.
Do you actually have training data? A useful fine-tune wants hundreds to thousands of high-quality examples. If you would need to manufacture them, your first project is an eval set and a better prompt, not a training run.
Have you exhausted prompting? Frontier models with a few-shot prompt and good context cover a shocking share of cases that teams assume need tuning. OpenAI's own fine-tuning guide tells you to optimize prompts against evals first — vendors selling training compute rarely lead with that.

The most expensive myth: fine-tuning as knowledge storage. Teams fine-tune on their docs expecting the model to know them. The result is a model that confidently sounds like the docs while hallucinating their contents — worse than RAG on day one and decaying from there as the docs evolve. If the requirement contains the word know, the answer is retrieval.

The Operational Bill Nobody Quotes

The decision is not only about capability — it is about what you are signing up to operate. I run infrastructure for a living, so this is the part I weigh hardest.

What each path costs you after launch:

RAG: an embedding pipeline, an index to keep in sync with sources, retrieval quality monitoring, and chunking decisions you will revisit. All of it is observable and fixable in production with normal engineering, and it works with the frontier model you already use.
Fine-tuning: dataset curation and versioning, training runs to manage, an eval harness to catch regressions and forgetting, redeployment for every knowledge or behavior update, and re-running everything when the base model is deprecated. You also typically tune a smaller model, trading away frontier reasoning.

When the Answer Is Both

The strongest production systems I have seen combine them with a clean division of labor: fine-tune for the consistent skin — output format, tone, domain conventions — and use RAG for every fact the answer depends on. A support assistant is the classic case: tuned to sound like your team and follow your escalation format, while every product detail comes from a retrieved, current knowledge base entry.

But hybrid is a graduation, not a starting point. Sequence it: prompt engineering with a proper eval set first, add RAG when knowledge is the gap, and fine-tune last, only when format or style failures persist on top of working retrieval. Each stage de-risks the next, and most products are genuinely done after the second.

The Bottom Line

RAG changes what the model sees; fine-tuning changes how it behaves. Knowledge that changes, needs citing, or must be deletable lives in retrieval. Behavior you want every single time lives in weights — once you have the data to prove it and the evals to protect it. Start with the prompt, add retrieval when knowledge is the gap, and treat fine-tuning as the specialized tool it is rather than the default it gets marketed as.

Sources and further reading

Frequently Asked Questions

RAG vs Fine-Tuning: An Honest Decision Framework

Frequently Asked Questions

RAG vs Fine-Tuning: An Honest Decision Framework

What Each Technique Actually Changes

The Decision Table

Four Questions Before You Decide

The Operational Bill Nobody Quotes

When the Answer Is Both

The Bottom Line

What Each Technique Actually Changes

The Decision Table

Four Questions Before You Decide

The Operational Bill Nobody Quotes

When the Answer Is Both

The Bottom Line