How much can prompt caching actually reduce LLM input costs?

Prompt caching can cut input costs by up to 90 percent on cached token reads across Anthropic, OpenAI, and Google Gemini. In practice, the author reduced one feature's input bill by roughly 85 percent by caching a 40K-token system prompt and document context that was previously re-sent at full price on every request.

What is the most common reason a prompt cache stops working?

Any byte change anywhere in the shared prefix silently invalidates everything after it. The most common culprits are timestamps or request UUIDs interpolated near the top of the system prompt, non-deterministic JSON key ordering in tool definitions, and per-user tool sets or conditional system sections that give every user a unique, cold cache.

When does it make sense to use a longer TTL on Anthropic instead of the default 5-minute window?

The 1-hour TTL doubles the write cost compared to the default, so it only pays off when a cache entry receives at least three hits before expiring. It is the right choice for bursty traffic patterns with long gaps between requests; for traffic that arrives more often than every 5 minutes, the default TTL self-refreshes and is both the cheapest and most practical option.

How should you structure a prompt to maximize cache hits?

Order content from most stable to most volatile: tool definitions and the system prompt first, session context next, and the per-request question last. The system prompt must contain no dynamic values such as timestamps, user names, or feature flags. JSON keys in tool definitions should be sorted deterministically so the rendered bytes are identical across every request, and the cache breakpoint should mark the end of the shared stable portion rather than the volatile suffix.

LLM Prompt Caching: The Cheapest Cost Optimization You Are Not Using

Q: How do Anthropic, OpenAI, and Gemini differ in their caching mechanics?

Anthropic requires explicit cache_control breakpoints (up to 4 per request) and charges a 1.25x write premium for a 5-minute TTL or 2x for 1 hour. OpenAI caches automatically with no code changes for prompts of 1,024 tokens or more at no write cost. Gemini offers both implicit caching on 2.5 and newer models and explicit cache objects with configurable TTL, billing storage per token-hour for explicit caches.

When I audited the token bill for one of my AI features earlier this year, the single biggest line item was not the clever part of the product. It was re-sending the same 40K-token system prompt and document context on every single request, at full price, hundreds of times a day. Prompt caching fixed that in an afternoon and cut the input bill by roughly 85 percent.

Every major provider now ships some form of prompt caching — Anthropic, OpenAI, and Google Gemini all do — but the economics and the developer ergonomics differ enough that the same prompt architecture can be nearly free on one platform and full price on another. This post covers how the three models of caching work, the actual numbers, and the prompt-architecture rules that make caching reliable instead of accidental.

How Prompt Caching Actually Works

All three providers implement the same core idea: a prefix match. The provider hashes the rendered prompt from the first byte forward, and if an earlier request already processed an identical prefix, the attention computation for that span is reused instead of recomputed. The consequence that drives everything else: any byte change anywhere in the prefix invalidates everything after it. A timestamp interpolated into your system prompt, a reordered JSON key in a tool definition, a per-user ID near the top — each one silently turns the cache off.

Anthropic makes caching explicit: you place a cache_control breakpoint on the last stable block, and content renders in the order tools, then system, then messages. OpenAI caches automatically with no code changes for prompts of 1,024 tokens or more. Gemini gives you both: implicit caching on Gemini 2.5 and newer models with no guarantees, and explicit cache objects you create and pay storage for when you want guaranteed savings.

// Anthropic: explicit breakpoint on the stable prefix
const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 4096,
  system: [
    {
      type: "text",
      text: BIG_STABLE_SYSTEM_PROMPT, // frozen — no timestamps in here
      cache_control: { type: "ephemeral" }, // 5-minute TTL
    },
  ],
  messages: [{ role: "user", content: userQuestion }], // volatile part last
})

Caching Economics Across Providers

Here is the comparison that matters when you are deciding where a high-volume feature lives. Figures are from each provider's official docs as of mid-2026, linked in the sources below:

Aspect	Anthropic Claude	OpenAI	Google Gemini
Activation	Explicit cache_control breakpoints, max 4 per request	Fully automatic, no code changes	Implicit (2.5 and newer) plus explicit cache objects
Cached read price	0.1x base input price — a 90 percent discount	Up to 90 percent cheaper on cached tokens	Reduced rate per cached token, model-dependent
Write premium	1.25x for 5-minute TTL, 2x for 1-hour TTL	None — caching is free and automatic	Storage billed per token-hour on explicit caches
Minimum cacheable size	512 to 4,096 tokens depending on model	1,024 tokens	2,048 to 4,096 tokens depending on model
Lifetime	5 minutes default, optional 1 hour	5 to 10 minutes of inactivity, up to 1 hour; extended retention up to 24 hours on newer models	Configurable TTL on explicit caches

Do the break-even math before reaching for long TTLs. On Anthropic, a 5-minute cache write costs 1.25x and reads cost 0.1x, so the second request already puts you ahead. The 1-hour TTL doubles the write cost, so it needs at least three hits to pay off — it only wins for bursty traffic with long gaps. If your traffic arrives more often than every 5 minutes, the default TTL self-refreshes and the cheap option is also the right one.

Architecting Prompts to Be Cache-Friendly

Caching rewards a discipline you should have anyway: separating what is stable from what is volatile. The rules I now apply to every LLM feature, in order:

Freeze the system prompt. No timestamps, no user names, no feature flags interpolated into it. Dynamic context goes into a later message where it invalidates nothing above it.
Order content by stability: tool definitions and system prompt first, session context next, the per-request question dead last.
Serialize deterministically. Sort JSON keys, never iterate sets into the prompt, and keep the tool list byte-identical across requests — adding or reordering one tool rebuilds the entire cache.
Put the breakpoint at the end of the shared portion, not the end of the prompt. If you mark the varying suffix, every request writes a new cache entry and reads nothing.
For multi-turn agents, move the breakpoint forward to the most recently appended turn each request, so the whole conversation prefix accrues incrementally.

The Silent Invalidators

When cache reads flatline, one of these is almost always hiding in the prompt-assembly path:

A datetime call or request UUID rendered near the top of the system prompt — the prefix changes every request.
Non-deterministic JSON serialization, where key order differs between processes or deploys.
Per-user tool sets or conditional system sections, which give every user or flag combination its own cold cache.
Model or provider switches mid-conversation — caches are scoped per model, so the first request after a switch is always a full-price write.

Verify It: Read the Usage Block, Not Your Assumptions

Every Anthropic response reports exactly what happened, and OpenAI and Gemini expose equivalent fields. Make a habit of logging them per route:

// Response usage block — your caching dashboard
{
  "usage": {
    "cache_creation_input_tokens": 5120, // wrote to cache (1.25x price)
    "cache_read_input_tokens": 48200,    // served from cache (0.1x price)
    "input_tokens": 310,                 // uncached remainder (full price)
    "output_tokens": 850
  }
}
// total prompt = creation + read + input_tokens.
// read == 0 on repeated identical requests? You have a silent invalidator.

I graph these three fields in Grafana next to request counts for every LLM-powered endpoint I run. The day a deploy accidentally introduces a timestamp into a system prompt, cache reads drop to zero on the chart and the bill explains itself before the invoice does.

The Takeaway

Prompt caching is the rare optimization that is both massive and boring: no quality trade-off, no model change, often no architecture change beyond moving volatile strings out of the prefix. For RAG apps, agents with big tool lists, and anything with a fat system prompt, it routinely cuts input spend by 80 to 90 percent.

Start with the audit: log cached versus uncached tokens for a day, find the routes re-paying for the same prefix, and fix the prompt assembly order. It is an afternoon of work with a recurring payoff every month.

Sources and further reading

Frequently Asked Questions

LLM Prompt Caching: The Cheapest Cost Optimization You Are Not Using

Frequently Asked Questions

LLM Prompt Caching: The Cheapest Cost Optimization You Are Not Using

How Prompt Caching Actually Works

Caching Economics Across Providers

Architecting Prompts to Be Cache-Friendly

The Silent Invalidators

Verify It: Read the Usage Block, Not Your Assumptions

The Takeaway

How Prompt Caching Actually Works

Caching Economics Across Providers

Architecting Prompts to Be Cache-Friendly

The Silent Invalidators

Verify It: Read the Usage Block, Not Your Assumptions

The Takeaway