My first month running AI Gymbro with Claude API in production cost me more than my monthly rent. I was making the classic beginner mistakes — sending full conversation history every request, using the most powerful model for every task, no caching whatsoever. After two months of optimization, I cut the bill by 60% while actually improving response quality. Here's exactly what I did, with real token counts and cost comparisons.
Claude 3.5 Haiku costs $0.80 per million input tokens and $4 per million output tokens as of early 2025. Claude 3.5 Sonnet is $3/$15. Claude Opus 4 is $15/$75. The ratio that matters: output tokens are almost always 3-5x more expensive per token than input tokens. OpenAI's GPT-4o is $2.50/$10, while GPT-4o-mini is $0.15/$0.60. Google Gemini 1.5 Flash is $0.075/$0.30 — the cheapest frontier model I've used. The dirty secret: most applications spend 60-80% of their token budget on input tokens (system prompts, context, conversation history), not on output. This is where optimization has the most leverage.
Before touching any code, spend a week logging token usage per request type. Anthropic's API returns usage.input_tokens and usage.output_tokens in every response. Aggregate these by endpoint, user tier, and time of day. I found that my 'chat with your workout plan' feature used 4x more tokens than my 'log a workout' feature but generated 10x less revenue per session.
Anthropic's prompt caching lets you cache static portions of your prompt and pay 10% of the normal input token price for cache reads. Cache writes cost 25% extra. If your system prompt is 2,000 tokens and you handle 1,000 requests per day, without caching that's 2 million tokens at $0.80/M = $1.60/day. With caching (90% hit rate), you pay around $0.34/day — a 79% reduction on system prompt costs alone.
LLM API Cost Optimization Stack
┌─────────────────────────────────────────────────────┐
│ Request arrives │
└─────────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ 1. Model Router │
│ Simple task? → Claude Haiku ($0.80/M) │
│ Medium task? → Claude Sonnet ($3.00/M) │
│ Complex task? → Claude Opus ($15.00/M) │
└─────────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ 2. Prompt Cache Check │
│ System prompt cached? → Pay 10% of input cost │
│ Cache miss? → Pay 25% extra to cache │
└─────────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ 3. History Compressor (every 5 turns) │
│ Full history → 200-token summary │
│ Saves ~45% on input tokens per session │
└─────────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ 4. Output Mode │
│ Real-time? → Sync API (full price) │
│ Async task? → Batch API (50% discount) │
└─────────────────────────────────────────────────────┘I reduced my Claude API bill by putting the static system prompt first in the messages array with cache_control: { type: 'ephemeral' }, then appending the dynamic conversation history after. The cache TTL is 5 minutes, so as long as users are actively chatting, cache hits are near 100%.
The most impactful optimization I made was routing different request types to different models. Before routing: everything went to Claude 3.5 Sonnet. After routing: tier 1 tasks go to Claude 3.5 Haiku, tier 2 stays on Sonnet, tier 3 (under 5% of requests) goes to Opus. Average cost per request dropped from $0.0043 to $0.0018 — a 58% reduction.
Every multi-turn conversation accumulates history. Compression strategy: after every 5 turns, call a cheap model with the prompt 'Summarize the key facts and decisions from this conversation in under 200 tokens.' Replace the raw history with this summary. This alone reduced my average conversation input tokens by 45%.
// Anthropic prompt caching — TypeScript
import Anthropic from "@anthropic-ai/sdk"
const client = new Anthropic()
const SYSTEM_PROMPT = `You are a fitness AI assistant for AI Gymbro...
[... 600+ tokens of static instructions ...]`
async function chat(userId: string, history: Message[], userMessage: string) {
const response = await client.messages.create({
model: "claude-3-5-haiku-20241022",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }, // Cache this!
},
],
messages: [
...history, // dynamic — not cached
{ role: "user", content: userMessage },
],
})
return response.content[0].text
}
// Model routing by complexity
function selectModel(taskType: "classify" | "plan" | "analyze") {
const models = {
classify: "claude-3-5-haiku-20241022", // $0.80/M
plan: "claude-3-5-sonnet-20241022", // $3.00/M
analyze: "claude-opus-4-5", // $15.00/M
}
return models[taskType]
}Not every LLM call needs to happen in real time. By moving weekly summaries to async batch jobs processed every 15 minutes, I could use Anthropic's batch API at 50% cost reduction. For my weekly summaries, this alone saved 15% of monthly API costs.
Routing simple queries to a cheap model sounds obvious, but the definition of 'simple' drifts. I switched all workout logging to Haiku and noticed within two weeks that it was hallucinating exercise names. Always A/B test quality metrics (not just cost) when routing to a cheaper model. Set up a small human evaluation pipeline before committing to model changes.
Forcing structured output via JSON mode or tool use eliminates the conversational wrapper. A response that used to be 'Sure! Here's your workout plan: {...json...}' becomes just the JSON. For my structured outputs, this reduced output tokens by 30-40% per request.
The complete stack I'm running now: prompt caching on all static system prompts (saves 60-70% on cached input tokens), model routing by task complexity (saves 40-50% on model cost), conversation compression after 5 turns (saves 40% on history tokens), batch processing for async tasks (saves 50% on batch-eligible calls), and structured outputs where applicable (saves 30% on output tokens). Combined effect: 60% overall cost reduction.