LLM Prompt Caching: The Cheapest Cost Optimization You Are Not Using

Photo by Google DeepMind

Photo by Google DeepMind
When I audited the token bill for one of my AI features earlier this year, the single biggest line item was not the clever part of the product. It was re-sending the same 40K-token system prompt and document context on every single request, at full price, hundreds of times a day. Prompt caching fixed that in an afternoon and cut the input bill by roughly 85 percent.
Every major provider now ships some form of prompt caching — Anthropic, OpenAI, and Google Gemini all do — but the economics and the developer ergonomics differ enough that the same prompt architecture can be nearly free on one platform and full price on another. This post covers how the three models of caching work, the actual numbers, and the prompt-architecture rules that make caching reliable instead of accidental.
All three providers implement the same core idea: a prefix match. The provider hashes the rendered prompt from the first byte forward, and if an earlier request already processed an identical prefix, the attention computation for that span is reused instead of recomputed. The consequence that drives everything else: any byte change anywhere in the prefix invalidates everything after it. A timestamp interpolated into your system prompt, a reordered JSON key in a tool definition, a per-user ID near the top — each one silently turns the cache off.
Anthropic makes caching explicit: you place a cache_control breakpoint on the last stable block, and content renders in the order tools, then system, then messages. OpenAI caches automatically with no code changes for prompts of 1,024 tokens or more. Gemini gives you both: implicit caching on Gemini 2.5 and newer models with no guarantees, and explicit cache objects you create and pay storage for when you want guaranteed savings.
// Anthropic: explicit breakpoint on the stable prefix
const response = await client.messages.create({
model: "claude-opus-4-8",
max_tokens: 4096,
system: [
{
type: "text",
text: BIG_STABLE_SYSTEM_PROMPT, // frozen — no timestamps in here
cache_control: { type: "ephemeral" }, // 5-minute TTL
},
],
messages: [{ role: "user", content: userQuestion }], // volatile part last
})Here is the comparison that matters when you are deciding where a high-volume feature lives. Figures are from each provider's official docs as of mid-2026, linked in the sources below:
| Aspect | Anthropic Claude | OpenAI | Google Gemini |
|---|---|---|---|
| Activation | Explicit cache_control breakpoints, max 4 per request | Fully automatic, no code changes | Implicit (2.5 and newer) plus explicit cache objects |
| Cached read price | 0.1x base input price — a 90 percent discount | Up to 90 percent cheaper on cached tokens | Reduced rate per cached token, model-dependent |
| Write premium | 1.25x for 5-minute TTL, 2x for 1-hour TTL | None — caching is free and automatic | Storage billed per token-hour on explicit caches |
| Minimum cacheable size | 512 to 4,096 tokens depending on model | 1,024 tokens | 2,048 to 4,096 tokens depending on model |
| Lifetime | 5 minutes default, optional 1 hour | 5 to 10 minutes of inactivity, up to 1 hour; extended retention up to 24 hours on newer models | Configurable TTL on explicit caches |
Do the break-even math before reaching for long TTLs. On Anthropic, a 5-minute cache write costs 1.25x and reads cost 0.1x, so the second request already puts you ahead. The 1-hour TTL doubles the write cost, so it needs at least three hits to pay off — it only wins for bursty traffic with long gaps. If your traffic arrives more often than every 5 minutes, the default TTL self-refreshes and the cheap option is also the right one.
Caching rewards a discipline you should have anyway: separating what is stable from what is volatile. The rules I now apply to every LLM feature, in order:
When cache reads flatline, one of these is almost always hiding in the prompt-assembly path:
Every Anthropic response reports exactly what happened, and OpenAI and Gemini expose equivalent fields. Make a habit of logging them per route:
// Response usage block — your caching dashboard
{
"usage": {
"cache_creation_input_tokens": 5120, // wrote to cache (1.25x price)
"cache_read_input_tokens": 48200, // served from cache (0.1x price)
"input_tokens": 310, // uncached remainder (full price)
"output_tokens": 850
}
}
// total prompt = creation + read + input_tokens.
// read == 0 on repeated identical requests? You have a silent invalidator.I graph these three fields in Grafana next to request counts for every LLM-powered endpoint I run. The day a deploy accidentally introduces a timestamp into a system prompt, cache reads drop to zero on the chart and the bill explains itself before the invoice does.
Prompt caching is the rare optimization that is both massive and boring: no quality trade-off, no model change, often no architecture change beyond moving volatile strings out of the prefix. For RAG apps, agents with big tool lists, and anything with a fat system prompt, it routinely cuts input spend by 80 to 90 percent.
Start with the audit: log cached versus uncached tokens for a day, find the routes re-paying for the same prefix, and fix the prompt assembly order. It is an afternoon of work with a recurring payoff every month.
Sources and further reading