Why isn't traditional observability (latency, error rate, throughput) enough for LLM applications?

LLM applications can fail in qualitative ways that standard metrics miss entirely: an API call returns 200 OK, but the response is factually wrong, off-topic, or ignores a critical instruction. To catch these failures you also need prompt and response logging, token consumption tracking, quality signals like user feedback and hallucination detection, and anomaly detection — none of which are covered by traditional infrastructure metrics.

What is Langfuse and why does the post recommend self-hosting it?

Langfuse is an open-source LLM observability platform that provides trace logging, automatic token cost calculation, session grouping, evaluation workflows, and dataset management. The post recommends self-hosting it on Docker Compose backed by PostgreSQL because it gives full control over data retention and privacy, which matters when traces contain user personal data. The hosted tier is free up to 50K observations per month for smaller projects.

Why should cost alerting be set up from day one, and what level does the post suggest?

Token costs can spike dramatically from a single bug — for example, a retry loop caused 400K tokens to be consumed in just 10 minutes. The post recommends creating a daily cost alert in Langfuse at 120% of expected daily spend so spikes are caught before they become multi-hundred dollar incidents, rather than discovered on a surprise bill.

How should LLM call costs be attributed, and what pitfall does the post highlight?

The post recommends attributing costs at the feature level (e.g., invoice_summary, chat_support, document_extraction) rather than only at the user level, so you can identify which features are driving spend or are inefficient. In one audit, a document_extraction feature was using 3x more tokens than necessary because the prompt included an entire document when only the first two pages were relevant — the code was functionally correct but needlessly expensive.

What privacy obligations apply to LLM trace logs, and what does the post recommend?

LLM trace logs contain the full text of user inputs and model outputs, making them personal data subject to GDPR, Indonesia's UU PDP, and other privacy regulations whenever the application processes names, contact details, or financial records. The post recommends implementing PII redaction before logging — using regex or a dedicated PII detection model to scrub emails, phone numbers, identity numbers, and financial data — and enforcing appropriate retention limits and access controls on stored traces.

AI Observability: How to Log, Trace, and Monitor LLM Applications

Shipping an LLM integration without observability is like deploying a web app without logs — you will discover problems from user complaints, not from your monitoring. The challenge is that LLM applications fail in qualitative ways that traditional metrics miss: the API call succeeds with 200 OK, but the response is factually wrong, off-topic, or ignored a critical instruction. Detecting these failures requires logging the full prompt and response, tracking quality metrics, and building anomaly detection on top of the telemetry. After running production LLM integrations for 18 months, this is the observability stack I have standardized on.

What to Observe in LLM Applications

Traditional observability (latency, error rate, throughput) is necessary but not sufficient for LLM apps. You also need: prompt/response logging (the full text, not just metadata), token consumption per call and per user, model version tracking (which model and which version generated each response), quality signals (user feedback, task completion rates, hallucination detection), cost attribution (which features or users are driving API spend), and anomaly detection for injection attempts, unusual output lengths, or format violations.

Langfuse: The LLM Observability Layer

Langfuse is an open-source LLM observability platform that I self-host on our DigitalOcean infrastructure. It provides: trace logging (full prompt, response, model, latency per call), cost tracking with automatic token price calculation, session grouping (linking multiple LLM calls in one user interaction), evaluation workflows (human or automated quality scoring), and dataset management for testing. The hosted version is free up to 50K observations/month — enough for most projects. Self-hosted runs with Docker Compose and connects to a PostgreSQL database.

┌─────────────────────────────────────────────────────────────┐
│              LLM Observability Stack                         │
│                                                             │
│  Application                                                │
│  ┌─────────────────────────────────────────────┐           │
│  │  NestJS API  →  LLM Service  →  Langfuse    │           │
│  │                    │            SDK          │           │
│  │                    ▼                         │           │
│  │             Anthropic / OpenAI               │           │
│  └─────────────────────────────────────────────┘           │
│                    │                                        │
│                    ▼                                        │
│  ┌──────────────────────────────────────────────┐          │
│  │  Langfuse (self-hosted)                      │          │
│  │  - Traces: prompt + response + latency       │          │
│  │  - Costs: tokens × price per model           │          │
│  │  - Sessions: linked conversation turns       │          │
│  │  - Evaluations: quality scores               │          │
│  └──────────────────────────────────────────────┘          │
│                    │                                        │
│                    ▼                                        │
│  ┌──────────────────────────────────────────────┐          │
│  │  Grafana Dashboard + PagerDuty Alerts        │          │
│  │  - Cost per feature per day                  │          │
│  │  - Latency p50/p95/p99                       │          │
│  │  - Error rate by model                       │          │
│  └──────────────────────────────────────────────┘          │
└─────────────────────────────────────────────────────────────┘

From my experience monitoring LLM integrations: set up cost alerting from day one, not after you get a surprise bill. In Langfuse, create a daily cost alert at 120% of your expected daily spend. Token costs can spike dramatically if a bug causes the system to send massive prompts or enter an LLM call loop. I had one incident where a retry loop caused 400K tokens in 10 minutes — caught by the alert before it became a multi-hundred dollar incident.

Implementing Structured LLM Tracing

The key is to instrument every LLM call with consistent metadata: user ID, session ID, feature name, model parameters, and the full prompt/response. Use a tracing library that handles serialization and batching to avoid adding latency to your application. Both the Langfuse SDK and OpenTelemetry-based approaches work well. The important thing is consistency — every call must have the same metadata structure so you can slice and dice by user, feature, or time range in your observability dashboard.

Token Cost Attribution Pattern

Implement cost attribution at the feature level, not just the user level. Tag each LLM call with the feature that triggered it (e.g., invoice_summary, chat_support, document_extraction). This lets you identify which features are driving cost and which are inefficient. In one audit, I found that our document_extraction feature was using 3x more tokens than necessary because the prompt included the full document when only the first 2 pages were relevant — a bug the code was correct, just expensive.

import Anthropic from "@anthropic-ai/sdk"
import { Langfuse } from "langfuse"

const anthropic = new Anthropic()
const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  baseUrl: process.env.LANGFUSE_HOST, // your self-hosted instance
})

async function callLLM(params: {
  userId: string
  sessionId: string
  feature: string
  systemPrompt: string
  userMessage: string
}) {
  const trace = langfuse.trace({
    id: crypto.randomUUID(),
    userId: params.userId,
    sessionId: params.sessionId,
    tags: [params.feature],
    metadata: { feature: params.feature },
  })

  const generation = trace.generation({
    name: "claude-call",
    model: "claude-opus-4-5",
    input: [
      { role: "system", content: params.systemPrompt },
      { role: "user", content: params.userMessage },
    ],
    startTime: new Date(),
  })

  try {
    const response = await anthropic.messages.create({
      model: "claude-opus-4-5",
      max_tokens: 1024,
      system: params.systemPrompt,
      messages: [{ role: "user", content: params.userMessage }],
    })

    generation.end({
      output: response.content,
      usage: {
        input: response.usage.input_tokens,
        output: response.usage.output_tokens,
      },
      endTime: new Date(),
    })

    return response.content[0].type === "text" ? response.content[0].text : ""
  } catch (err) {
    generation.end({ level: "ERROR", statusMessage: String(err) })
    throw err
  } finally {
    await langfuse.flushAsync()
  }
}

Quality Metrics and Evaluation

Binary success/failure metrics do not capture LLM output quality. Build a quality evaluation pipeline: (1) User feedback collection — thumbs up/down on responses, surfaced in the UI. (2) Automated format checks — verify that structured outputs match expected schemas. (3) Faithfulness checks for RAG — verify that responses only reference information from the retrieved context. (4) Latency percentiles — track p50, p95, p99 separately; p99 latency often reveals timeout issues not visible in average metrics.

LLM trace logs contain the full text of user inputs and model outputs. If your application processes personal data (names, contact details, financial records), your trace logs are personal data subject to GDPR, Indonesia's UU PDP, or other privacy regulations. Implement PII redaction before logging: use a regex or a dedicated PII detection model to scrub emails, phone numbers, identity numbers, and financial data from traces. Store traces with appropriate retention limits and access controls — not every developer needs access to production traces.

Anomaly Detection for LLM Systems

Build anomaly detectors specific to LLM failure modes: output length distribution (sudden change in average output length may indicate prompt injection or model behavior change), tool call frequency (spike in tool calls per session suggests a loop or injection), error rate by model (model version changes can cause quality regressions), and user-specific anomalies (one user generating 10x more tokens than average may indicate abuse or a bot). Alert on these in your observability stack alongside standard infrastructure metrics.

My Production Observability Setup

The full stack I run: Langfuse (self-hosted, Docker Compose, PostgreSQL) for LLM traces and cost tracking. Custom Prometheus metrics for infrastructure (API latency, queue depth, error rates). Grafana dashboards combining Prometheus and Langfuse data via API. PagerDuty alerts for cost spikes, error rate increases, and latency p99 breaches. Retention: 90 days for full traces, aggregated metrics kept indefinitely. Monthly review: cost by feature, quality score trends, top error categories. This setup catches every significant LLM issue within minutes and provides the data needed for continuous prompt optimization.

Sources & Further Reading

Frequently Asked Questions

AI Observability: How to Log, Trace, and Monitor LLM Applications

Frequently Asked Questions

AI Observability: How to Log, Trace, and Monitor LLM Applications

What to Observe in LLM Applications

Langfuse: The LLM Observability Layer

Implementing Structured LLM Tracing

Token Cost Attribution Pattern

Quality Metrics and Evaluation

Anomaly Detection for LLM Systems

My Production Observability Setup

Related Articles

What to Observe in LLM Applications

Langfuse: The LLM Observability Layer

Implementing Structured LLM Tracing

Token Cost Attribution Pattern

Quality Metrics and Evaluation

Anomaly Detection for LLM Systems

My Production Observability Setup

Related Articles