Why should I start with auto-instrumentation rather than writing custom spans from the beginning?

Auto-instrumentation via @opentelemetry/auto-instrumentations-node gives you immediate visibility into HTTP calls, database queries, and Redis operations without writing a single line of span code. Starting there lets you understand the actual trace topology of your service-to-service interactions before deciding where custom spans add real value. Adding custom spans prematurely — before you know the topology — risks creating noise and unnecessary storage cost.

How does context propagation work when Service A calls Service B?

When Service A makes an outbound HTTP call to Service B, OpenTelemetry automatically injects the current trace ID and span ID into the outgoing headers using the W3C Trace Context standard. Service B extracts those headers and creates its own child spans under the same trace, so the full request journey appears as a single connected tree in your tracing backend. For Kafka and RabbitMQ this propagation is not automatic — you must manually carry context via message headers using OpenTelemetry messaging semantic conventions.

What sampling strategy balances cost and signal quality on high-traffic systems?

The post recommends combining head-based sampling (sample roughly 10% of normal traffic) with tail-based sampling applied at the OpenTelemetry Collector level. Tail sampling ensures that traces containing errors or latency above a threshold are always retained, regardless of the head-based decision. Without tail sampling, the traces you need most during an incident are the most likely to be dropped.

When should I choose Grafana Tempo over Jaeger for self-hosted tracing?

For new self-hosted deployments the post recommends Grafana Tempo because it stores traces in cheap object storage (S3 or GCS) rather than Jaeger's Cassandra or Elasticsearch, which significantly lowers storage cost at scale. Tempo also integrates natively with Grafana dashboards, giving a unified observability view if you already run Grafana. The OpenTelemetry Collector makes switching between any backends — including Jaeger, Tempo, Honeycomb, or Datadog — a configuration change with no application code changes required.

How does distributed tracing speed up incident investigation?

Instead of correlating logs across 5–10 individual services manually, you open the relevant operation in your tracing backend and the waterfall view shows exactly which service was called, in what order, how long each span took, and where errors occurred. The post describes this workflow as turning a 30-minute incident investigation into a 2-minute one. Setting up P99 latency, error rate, and trace count alerts per service lets you reach the right trace before users file a ticket.

Distributed Tracing with OpenTelemetry: A Practical NestJS Implementation

OpenTelemetry is now the second-highest-velocity CNCF project with over 24,000 contributors and nearly 50% adoption among cloud-native companies. It graduated to CNCF's top maturity level in 2026, cementing its position as the standard observability framework. The project saw a 45% year-over-year increase in code commits in 2024 and a 100% increase in search volume — it's clear the industry is converging on OpenTelemetry as the single observability layer. For NestJS developers building microservices, this means you have one standardized way to instrument your applications and send telemetry to any backend: Jaeger, Grafana Tempo, Honeycomb, Datadog, or your own collector.

What Distributed Tracing Solves

In a microservices system, a single user request might touch 5-10 services before returning a response. When that request fails or is slow, logs from individual services show what happened inside each box — but not how the services interacted. Distributed tracing captures the full journey: which service was called, in what order, how long each call took, and where errors occurred. A trace is a tree of spans — each span represents a unit of work (an HTTP call, a database query, a message queue consume). OpenTelemetry standardizes how spans are created, annotated, and propagated across service boundaries.

The Three Pillars and OpenTelemetry's Role

Observability has three pillars: metrics (are things running?), logs (what happened?), and traces (why is it slow or broken?). OpenTelemetry provides a unified SDK for all three. The OpenTelemetry Collector is a vendor-agnostic pipeline — receive telemetry from your apps, process and transform it, and export to any backend. This means you can switch from Jaeger to Grafana Tempo without changing a single line of application code — just reconfigure the Collector.

OpenTelemetry Architecture in NestJS Microservices

  ┌─────────────────────────────────────────────────────────┐
  │  User Request: POST /invoices                           │
  │  trace_id: 4bf92f3577b34da6 (W3C Trace Context header) │
  └──────────────────────────┬──────────────────────────────┘
                             │
  ┌──────────────────────────▼──────────────────────────────┐
  │  API Gateway (span: gateway.route)          [0ms]       │
  │  ├── Auth validation (span: jwt.verify)     [2ms]       │
  │  └── Route to InvoiceService               [5ms]       │
  └──────────────────────────┬──────────────────────────────┘
                             │ gRPC + trace context
  ┌──────────────────────────▼──────────────────────────────┐
  │  Invoice Service (span: invoice.create)    [12ms]       │
  │  ├── DB insert (span: db.postgresql)        [8ms]       │
  │  ├── Publish Kafka event (span: kafka.send) [3ms]       │
  │  └── Return response                                    │
  └─────────────────────────────────────────────────────────┘

  OpenTelemetry Collector Pipeline:
  App (OTLP) → Collector → [batch, filter] → Grafana Tempo
                                           → Prometheus (metrics)
                                           → Loki (logs)

  All backends from ONE instrumentation layer

From instrumenting the Commsult ERP services: start with auto-instrumentation only — the @opentelemetry/auto-instrumentations-node package instruments HTTP, database, and Redis calls without writing a single span. Once you understand the trace topology, add custom spans for your business-critical paths (e.g., the invoice generation flow, the multi-level approval chain). Don't instrument everything — too many spans create noise and storage cost. Instrument the paths where you need to answer 'why is this slow?' most often.

Setting Up OpenTelemetry in NestJS

OpenTelemetry setup in NestJS requires initializing the SDK before the application bootstraps. Create a tracing.ts file that configures the OTLP exporter, sets the service name, and registers the Node.js auto-instrumentations. Import this file as the first line in your main.ts before the NestJS bootstrap call. The auto-instrumentation package automatically wraps HTTP calls (via http/https), database queries (TypeORM, Prisma, pg), and Redis operations — giving you traces for these without decorator code.

// tracing.ts — Initialize BEFORE main.ts bootstrap
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({ [SEMRESATTRS_SERVICE_NAME]: 'invoice-service' }),
  traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },      // PostgreSQL
      '@opentelemetry/instrumentation-ioredis': { enabled: true }, // Redis
    }),
  ],
});
sdk.start();

// main.ts — tracing.ts must be imported FIRST
import './tracing'; // ← before any other import
import { NestFactory } from '@nestjs/core';

// invoice.service.ts — Custom span for business logic
import { trace, SpanStatusCode } from '@opentelemetry/api';

@Injectable()
export class InvoiceService {
  async createInvoice(dto: CreateInvoiceDto) {
    const tracer = trace.getTracer('invoice-service');
    return tracer.startActiveSpan('invoice.create', async (span) => {
      span.setAttribute('invoice.tenant_id', dto.tenantId);
      span.setAttribute('invoice.amount', dto.amount);
      try {
        const invoice = await this.repo.save(dto);
        span.setStatus({ code: SpanStatusCode.OK });
        return invoice;
      } catch (err) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
        throw err;
      } finally {
        span.end();
      }
    });
  }
}

Context Propagation Across Services

The power of distributed tracing comes from context propagation — the trace ID and span ID flowing from service to service via HTTP headers (W3C Trace Context standard). When Service A calls Service B over HTTP, it injects the current span context into the outgoing headers. Service B extracts the context and creates child spans under the same trace. OpenTelemetry's W3C Trace Context propagator does this automatically for HTTP calls when auto-instrumented. For Kafka and RabbitMQ, you need to manually propagate context via message headers — use the OpenTelemetry messaging semantic conventions.

Recording 100% of traces on a high-traffic system is expensive — storage costs multiply quickly. But sampling too aggressively means missing the traces that matter (errors, slow requests). Use head-based sampling as your baseline (sample 10% of normal traffic) combined with tail-based sampling at the Collector level — always keep traces with errors or latency above threshold regardless of the head-based decision. The OpenTelemetry Collector's tail sampling processor handles this. Without tail sampling, your most important traces are the most likely to be dropped.

Custom Spans and Semantic Conventions

Auto-instrumentation covers infrastructure calls, but your business logic spans aren't automatically visible. Add custom spans for critical business operations using the OpenTelemetry API. Use OpenTelemetry's semantic conventions for span attributes — these are standardized attribute names (http.method, db.system, messaging.system) that make your traces queryable across tools. Custom attributes should describe business context: invoice.id, approval.level, user.role. These make traces searchable by business entity, not just by technical identifiers.

Choosing a Backend: Jaeger, Grafana Tempo, or Managed

For self-hosted: Grafana Tempo is the better choice over Jaeger for new deployments — Tempo uses object storage (S3, GCS) for trace storage (much cheaper than Jaeger's Cassandra or Elasticsearch) and integrates natively with Grafana dashboards. For managed: Honeycomb, Datadog APM, and Grafana Cloud all accept OTLP. For a small team or a startup, Grafana Cloud's free tier covers moderate trace volumes. For an enterprise with compliance requirements, managed services remove the operational burden but increase cost. The OpenTelemetry Collector makes switching between backends a configuration change.

From Traces to Actionable Insights

Distributed tracing is only valuable if you use it. Set up three alerting queries: P99 latency by service, error rate by service, and trace count by operation — alert when any crosses a threshold. Create a dashboard showing your critical user journeys as waterfall views — the invoice creation flow, the approval workflow, the report generation. When an incident occurs, start with the relevant operation in your tracing backend, find the slowest or erroring span, and drill down to the service responsible. This workflow turns a 30-minute incident investigation into a 2-minute one.

Sources & Further Reading

CNCF — OpenTelemetry Project Journey Report — https://www.cncf.io/reports/opentelemetry-project-journey-report/
OpenTelemetry Docs — Getting Started with Node.js — https://opentelemetry.io/docs/languages/js/getting-started/nodejs/
Grafana Labs — OpenTelemetry and Grafana Labs: What's New in 2025 — https://grafana.com/blog/opentelemetry-and-grafana-labs-whats-new-and-whats-next-in-2025/

Frequently Asked Questions

Distributed Tracing with OpenTelemetry: A Practical NestJS Implementation

Frequently Asked Questions

Distributed Tracing with OpenTelemetry: A Practical NestJS Implementation

What Distributed Tracing Solves

The Three Pillars and OpenTelemetry's Role

Setting Up OpenTelemetry in NestJS

Context Propagation Across Services

Custom Spans and Semantic Conventions

Choosing a Backend: Jaeger, Grafana Tempo, or Managed

From Traces to Actionable Insights

Sources & Further Reading

What Distributed Tracing Solves

The Three Pillars and OpenTelemetry's Role

Setting Up OpenTelemetry in NestJS

Context Propagation Across Services

Custom Spans and Semantic Conventions

Choosing a Backend: Jaeger, Grafana Tempo, or Managed

From Traces to Actionable Insights

Sources & Further Reading