What was the single biggest cost-saving technique used in this project?

Prompt caching had the largest individual impact. By placing static system prompts first in the messages array with cache_control set to ephemeral, subsequent requests paid only 10% of the normal input token price on cache reads. At a 90% hit rate on a 2,000-token system prompt with 1,000 daily requests, this reduced system prompt costs by 79%.

How does model routing work and how much did it save?

Model routing means directing different task types to different-tier models instead of sending everything to the most capable one. After categorizing tasks into three complexity tiers — simple classification, moderate reasoning, and complex personalization — the average cost per request dropped from $0.0043 to $0.0018, a 58% reduction, without degrading overall quality.

What is conversation history compression and why does it matter?

As multi-turn conversations grow, the full history is re-sent with every request, inflating input token counts. After every 5 turns, a cheap model is called to summarize the conversation into under 200 tokens, and that summary replaces the raw history. This technique alone reduced average conversation input tokens by 45%.

What quality risk should you watch for when routing to cheaper models?

The definition of a 'simple' task can drift over time, leading cheaper models to handle requests they weren't validated for. In this project, routing all workout logging to Claude 3.5 Haiku caused it to hallucinate exercise names within two weeks. Always A/B test quality metrics — not just cost — and set up a small human evaluation pipeline before committing to model changes.

Which tasks are good candidates for the batch API and how much does it save?

Tasks that do not need to happen in real time — such as weekly workout summaries — are ideal for batch processing. Using Anthropic's batch API for these async jobs comes at a 50% cost reduction compared to synchronous calls, and accounted for 15% of the total monthly API savings in this project.

How I Cut My LLM API Bill by 60% Without Touching Model Quality

My first month running AI Gymbro with Claude API in production cost me more than my monthly rent. I was making the classic beginner mistakes — sending full conversation history every request, using the most powerful model for every task, no caching whatsoever. After two months of optimization, I cut the bill by 60% while actually improving response quality. Here's exactly what I did, with real token counts and cost comparisons.

Understanding LLM Pricing: Where Your Money Actually Goes

Claude 3.5 Haiku costs $0.80 per million input tokens and $4 per million output tokens as of early 2025. Claude 3.5 Sonnet is $3/$15. Claude Opus 4 is $15/$75. The ratio that matters: output tokens are almost always 3-5x more expensive per token than input tokens. OpenAI's GPT-4o is $2.50/$10, while GPT-4o-mini is $0.15/$0.60. Google Gemini 1.5 Flash is $0.075/$0.30 — the cheapest frontier model I've used. The dirty secret: most applications spend 60-80% of their token budget on input tokens (system prompts, context, conversation history), not on output. This is where optimization has the most leverage.

Token Counting Before You Optimize

Before touching any code, spend a week logging token usage per request type. Anthropic's API returns usage.input_tokens and usage.output_tokens in every response. Aggregate these by endpoint, user tier, and time of day. I found that my 'chat with your workout plan' feature used 4x more tokens than my 'log a workout' feature but generated 10x less revenue per session.

Prompt Caching: The Single Biggest Win

Anthropic's prompt caching lets you cache static portions of your prompt and pay 10% of the normal input token price for cache reads. Cache writes cost 25% extra. If your system prompt is 2,000 tokens and you handle 1,000 requests per day, without caching that's 2 million tokens at $0.80/M = $1.60/day. With caching (90% hit rate), you pay around $0.34/day — a 79% reduction on system prompt costs alone.

LLM API Cost Optimization Stack

  ┌─────────────────────────────────────────────────────┐
  │  Request arrives                                    │
  └─────────────────────────┬───────────────────────────┘
                            │
                            ▼
  ┌─────────────────────────────────────────────────────┐
  │  1. Model Router                                    │
  │     Simple task?  → Claude Haiku  ($0.80/M)        │
  │     Medium task?  → Claude Sonnet ($3.00/M)        │
  │     Complex task? → Claude Opus   ($15.00/M)       │
  └─────────────────────────┬───────────────────────────┘
                            │
                            ▼
  ┌─────────────────────────────────────────────────────┐
  │  2. Prompt Cache Check                              │
  │     System prompt cached? → Pay 10% of input cost  │
  │     Cache miss?           → Pay 25% extra to cache │
  └─────────────────────────┬───────────────────────────┘
                            │
                            ▼
  ┌─────────────────────────────────────────────────────┐
  │  3. History Compressor (every 5 turns)              │
  │     Full history → 200-token summary               │
  │     Saves ~45% on input tokens per session         │
  └─────────────────────────┬───────────────────────────┘
                            │
                            ▼
  ┌─────────────────────────────────────────────────────┐
  │  4. Output Mode                                     │
  │     Real-time?  → Sync API  (full price)           │
  │     Async task? → Batch API (50% discount)         │
  └─────────────────────────────────────────────────────┘

I reduced my Claude API bill by putting the static system prompt first in the messages array with cache_control: { type: 'ephemeral' }, then appending the dynamic conversation history after. The cache TTL is 5 minutes, so as long as users are actively chatting, cache hits are near 100%.

Model Routing: Use the Right Model for the Right Task

The most impactful optimization I made was routing different request types to different models. Before routing: everything went to Claude 3.5 Sonnet. After routing: tier 1 tasks go to Claude 3.5 Haiku, tier 2 stays on Sonnet, tier 3 (under 5% of requests) goes to Opus. Average cost per request dropped from $0.0043 to $0.0018 — a 58% reduction.

Conversation History Compression

Every multi-turn conversation accumulates history. Compression strategy: after every 5 turns, call a cheap model with the prompt 'Summarize the key facts and decisions from this conversation in under 200 tokens.' Replace the raw history with this summary. This alone reduced my average conversation input tokens by 45%.

// Anthropic prompt caching — TypeScript
import Anthropic from "@anthropic-ai/sdk"

const client = new Anthropic()

const SYSTEM_PROMPT = `You are a fitness AI assistant for AI Gymbro...
[... 600+ tokens of static instructions ...]`

async function chat(userId: string, history: Message[], userMessage: string) {
  const response = await client.messages.create({
    model: "claude-3-5-haiku-20241022",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" }, // Cache this!
      },
    ],
    messages: [
      ...history,  // dynamic — not cached
      { role: "user", content: userMessage },
    ],
  })
  return response.content[0].text
}

// Model routing by complexity
function selectModel(taskType: "classify" | "plan" | "analyze") {
  const models = {
    classify: "claude-3-5-haiku-20241022",   // $0.80/M
    plan:     "claude-3-5-sonnet-20241022",  // $3.00/M
    analyze:  "claude-opus-4-5",             // $15.00/M
  }
  return models[taskType]
}

Batching and Async Processing

Not every LLM call needs to happen in real time. By moving weekly summaries to async batch jobs processed every 15 minutes, I could use Anthropic's batch API at 50% cost reduction. For my weekly summaries, this alone saved 15% of monthly API costs.

Routing simple queries to a cheap model sounds obvious, but the definition of 'simple' drifts. I switched all workout logging to Haiku and noticed within two weeks that it was hallucinating exercise names. Always A/B test quality metrics (not just cost) when routing to a cheaper model. Set up a small human evaluation pipeline before committing to model changes.

Structured Outputs and Output Token Reduction

Forcing structured output via JSON mode or tool use eliminates the conversational wrapper. A response that used to be 'Sure! Here's your workout plan: {...json...}' becomes just the JSON. For my structured outputs, this reduced output tokens by 30-40% per request.

My Full Optimization Stack

The complete stack I'm running now: prompt caching on all static system prompts (saves 60-70% on cached input tokens), model routing by task complexity (saves 40-50% on model cost), conversation compression after 5 turns (saves 40% on history tokens), batch processing for async tasks (saves 50% on batch-eligible calls), and structured outputs where applicable (saves 30% on output tokens). Combined effect: 60% overall cost reduction.

Frequently Asked Questions

How I Cut My LLM API Bill by 60% Without Touching Model Quality

Frequently Asked Questions

How I Cut My LLM API Bill by 60% Without Touching Model Quality

Understanding LLM Pricing: Where Your Money Actually Goes

Token Counting Before You Optimize

Prompt Caching: The Single Biggest Win

Model Routing: Use the Right Model for the Right Task

Conversation History Compression

Batching and Async Processing

Structured Outputs and Output Token Reduction

My Full Optimization Stack

Sources & Further Reading

Related Articles

Understanding LLM Pricing: Where Your Money Actually Goes

Token Counting Before You Optimize

Prompt Caching: The Single Biggest Win

Model Routing: Use the Right Model for the Right Task

Conversation History Compression

Batching and Async Processing

Structured Outputs and Output Token Reduction

My Full Optimization Stack

Sources & Further Reading

Related Articles