Shipping an LLM integration without observability is like deploying a web app without logs — you will discover problems from user complaints, not from your monitoring. The challenge is that LLM applications fail in qualitative ways that traditional metrics miss: the API call succeeds with 200 OK, but the response is factually wrong, off-topic, or ignored a critical instruction. Detecting these failures requires logging the full prompt and response, tracking quality metrics, and building anomaly detection on top of the telemetry. After running production LLM integrations for 18 months, this is the observability stack I have standardized on.
Traditional observability (latency, error rate, throughput) is necessary but not sufficient for LLM apps. You also need: prompt/response logging (the full text, not just metadata), token consumption per call and per user, model version tracking (which model and which version generated each response), quality signals (user feedback, task completion rates, hallucination detection), cost attribution (which features or users are driving API spend), and anomaly detection for injection attempts, unusual output lengths, or format violations.
Langfuse is an open-source LLM observability platform that I self-host on our DigitalOcean infrastructure. It provides: trace logging (full prompt, response, model, latency per call), cost tracking with automatic token price calculation, session grouping (linking multiple LLM calls in one user interaction), evaluation workflows (human or automated quality scoring), and dataset management for testing. The hosted version is free up to 50K observations/month — enough for most projects. Self-hosted runs with Docker Compose and connects to a PostgreSQL database.
┌─────────────────────────────────────────────────────────────┐
│ LLM Observability Stack │
│ │
│ Application │
│ ┌─────────────────────────────────────────────┐ │
│ │ NestJS API → LLM Service → Langfuse │ │
│ │ │ SDK │ │
│ │ ▼ │ │
│ │ Anthropic / OpenAI │ │
│ └─────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Langfuse (self-hosted) │ │
│ │ - Traces: prompt + response + latency │ │
│ │ - Costs: tokens × price per model │ │
│ │ - Sessions: linked conversation turns │ │
│ │ - Evaluations: quality scores │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Grafana Dashboard + PagerDuty Alerts │ │
│ │ - Cost per feature per day │ │
│ │ - Latency p50/p95/p99 │ │
│ │ - Error rate by model │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘From my experience monitoring LLM integrations: set up cost alerting from day one, not after you get a surprise bill. In Langfuse, create a daily cost alert at 120% of your expected daily spend. Token costs can spike dramatically if a bug causes the system to send massive prompts or enter an LLM call loop. I had one incident where a retry loop caused 400K tokens in 10 minutes — caught by the alert before it became a multi-hundred dollar incident.
The key is to instrument every LLM call with consistent metadata: user ID, session ID, feature name, model parameters, and the full prompt/response. Use a tracing library that handles serialization and batching to avoid adding latency to your application. Both the Langfuse SDK and OpenTelemetry-based approaches work well. The important thing is consistency — every call must have the same metadata structure so you can slice and dice by user, feature, or time range in your observability dashboard.
Implement cost attribution at the feature level, not just the user level. Tag each LLM call with the feature that triggered it (e.g., invoice_summary, chat_support, document_extraction). This lets you identify which features are driving cost and which are inefficient. In one audit, I found that our document_extraction feature was using 3x more tokens than necessary because the prompt included the full document when only the first 2 pages were relevant — a bug the code was correct, just expensive.
import Anthropic from "@anthropic-ai/sdk"
import { Langfuse } from "langfuse"
const anthropic = new Anthropic()
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
baseUrl: process.env.LANGFUSE_HOST, // your self-hosted instance
})
async function callLLM(params: {
userId: string
sessionId: string
feature: string
systemPrompt: string
userMessage: string
}) {
const trace = langfuse.trace({
id: crypto.randomUUID(),
userId: params.userId,
sessionId: params.sessionId,
tags: [params.feature],
metadata: { feature: params.feature },
})
const generation = trace.generation({
name: "claude-call",
model: "claude-opus-4-5",
input: [
{ role: "system", content: params.systemPrompt },
{ role: "user", content: params.userMessage },
],
startTime: new Date(),
})
try {
const response = await anthropic.messages.create({
model: "claude-opus-4-5",
max_tokens: 1024,
system: params.systemPrompt,
messages: [{ role: "user", content: params.userMessage }],
})
generation.end({
output: response.content,
usage: {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
},
endTime: new Date(),
})
return response.content[0].type === "text" ? response.content[0].text : ""
} catch (err) {
generation.end({ level: "ERROR", statusMessage: String(err) })
throw err
} finally {
await langfuse.flushAsync()
}
}Binary success/failure metrics do not capture LLM output quality. Build a quality evaluation pipeline: (1) User feedback collection — thumbs up/down on responses, surfaced in the UI. (2) Automated format checks — verify that structured outputs match expected schemas. (3) Faithfulness checks for RAG — verify that responses only reference information from the retrieved context. (4) Latency percentiles — track p50, p95, p99 separately; p99 latency often reveals timeout issues not visible in average metrics.
LLM trace logs contain the full text of user inputs and model outputs. If your application processes personal data (names, contact details, financial records), your trace logs are personal data subject to GDPR, Indonesia's UU PDP, or other privacy regulations. Implement PII redaction before logging: use a regex or a dedicated PII detection model to scrub emails, phone numbers, identity numbers, and financial data from traces. Store traces with appropriate retention limits and access controls — not every developer needs access to production traces.
Build anomaly detectors specific to LLM failure modes: output length distribution (sudden change in average output length may indicate prompt injection or model behavior change), tool call frequency (spike in tool calls per session suggests a loop or injection), error rate by model (model version changes can cause quality regressions), and user-specific anomalies (one user generating 10x more tokens than average may indicate abuse or a bot). Alert on these in your observability stack alongside standard infrastructure metrics.
The full stack I run: Langfuse (self-hosted, Docker Compose, PostgreSQL) for LLM traces and cost tracking. Custom Prometheus metrics for infrastructure (API latency, queue depth, error rates). Grafana dashboards combining Prometheus and Langfuse data via API. PagerDuty alerts for cost spikes, error rate increases, and latency p99 breaches. Retention: 90 days for full traces, aggregated metrics kept indefinitely. Monthly review: cost by feature, quality score trends, top error categories. This setup catches every significant LLM issue within minutes and provides the data needed for continuous prompt optimization.