When should I choose Kafka over a simpler queue like BullMQ or pg-boss?

Kafka earns its complexity when multiple independent services need to consume the same event stream, when you need durable event replay, or when you're processing millions of events daily. For systems with fewer than 3 services consuming events or throughput under 10K events per day, BullMQ with Redis handles queuing, retries, and priority with far less operational overhead. The author uses simple database-based queues for ERP workloads like email triggers and invoice generation at Commsult Indonesia.

What is a Dead Letter Queue (DLQ) and why is it critical in a production Kafka setup?

A DLQ is a separate topic where messages are routed when a consumer fails to process them after a bounded number of retries. Without a DLQ, an unhandled consumer error causes Kafka to retry indefinitely, blocking that partition from advancing. In NestJS, the error handler should catch failures, log the message to a DLQ topic (e.g., 'erp.v1.invoice.created.dlq'), and acknowledge the original — enabling alerting, manual inspection, and replay after a bug fix.

How do consumer groups enable horizontal scaling in NestJS Kafka services?

Each partition in a Kafka topic is assigned to exactly one consumer within a group at a time, so adding a new NestJS consumer instance causes Kafka to automatically rebalance partitions across all instances. A key caveat is that during rebalancing, all consumers in the group pause processing while partitions are reassigned. To minimize rebalance impact, use Kafka's static membership feature so that rolling restarts don't trigger unnecessary rebalances, and keep partition count close to your expected consumer count.

How does Schema Registry address schema evolution in event-driven systems?

When a new required field is added to an existing event schema, there is a window where old producers emit events without that field while new consumers expect it. The Confluent Schema Registry (and open-source alternatives like Apicurio) solves this by enforcing schema versioning and compatibility checks. Enforcing 'BACKWARD' compatibility means new schemas can read messages written by old schemas, allowing consumers to be updated before producers without requiring a coordinated deployment.

Event-Driven Architecture with Kafka and NestJS: A Production Guide

Q: How should I design Kafka topic names in production?

Use descriptive, versioned topic names that include the domain, version, entity, and event verb — for example 'erp.v1.invoice.created'. Establishing this convention from the start matters because renaming topics in production requires coordinating all producers and consumers simultaneously, which is a painful migration. This naming scheme also makes adding a v2 schema later straightforward without breaking existing consumers.

Apache Kafka can sustain 2 million writes per second on three commodity machines — a benchmark LinkedIn published that still holds up today. That number tends to make developers reach for Kafka immediately. But I've learned that throughput isn't the question you should be asking first. The first question is: does your problem actually need an event bus? I've built ERP systems at Commsult Indonesia that use simple database-based queues for email triggers and invoice generation — and that's the right call for those workloads. Kafka earns its complexity when you have multiple services that need to react to the same events independently, when you need durable event replay, or when you're processing millions of events daily. This guide covers when Kafka genuinely adds value, and how to implement it correctly with NestJS.

Understanding Kafka's Core Model

Kafka's architecture is built around topics, partitions, and consumer groups. A topic is a category of events — think 'order.created' or 'invoice.paid'. Topics are divided into partitions, which are the unit of parallelism. Each partition is an ordered, immutable sequence of records that Kafka retains for a configurable period (default 7 days). Consumer groups allow multiple service instances to read from a topic in parallel — each partition is assigned to exactly one consumer in a group at any given time. This means you can scale consumers horizontally by adding partitions.

Designing Topics and Partitions

Topic design is where most teams make mistakes. Too few partitions limits parallelism — if you have one partition, only one consumer can read from it at a time. Too many partitions increases memory overhead on the broker side and complicates rebalancing. A reasonable starting point: set partition count equal to the peak consumer concurrency you expect. For a payment processing service with 8 consumer instances, start with 8 partitions. Use a meaningful partition key — typically a tenant ID, user ID, or entity ID — so related events land on the same partition and preserve ordering for that entity.

Kafka Event-Driven Architecture (NestJS)

  Producers (NestJS Services)
  ┌───────────────────────────┐
  │  InvoiceService           │──── erp.v1.invoice.created ──┐
  │  OrderService             │──── erp.v1.order.placed   ──┤
  │  PaymentService           │──── erp.v1.payment.received─┘
  └───────────────────────────┘
                                        │
                               ┌────────▼────────┐
                               │   Kafka Broker   │
                               │  (3 nodes, HA)   │
                               │                  │
                               │ Topic Partitions │
                               │  P0 | P1 | P2    │
                               └────────┬────────┘
                                        │
        ┌───────────────────────────────┼──────────────────────────┐
        │                               │                          │
  ┌─────▼─────┐                  ┌──────▼──────┐          ┌───────▼──────┐
  │ Notification│                  │ PDF Report  │          │  Audit Log   │
  │  Consumer  │                  │   Consumer  │          │   Consumer   │
  │ (Group: A) │                  │  (Group: B) │          │  (Group: C)  │
  └────────────┘                  └─────────────┘          └──────────────┘

  Dead Letter Queue: erp.v1.invoice.created.dlq
  (Failed messages routed here for inspection & replay)

From my experience building event-driven modules for our ERP at Commsult: use descriptive, versioned topic names like 'erp.v1.invoice.created' from the start. Renaming topics in production requires coordinating all producers and consumers simultaneously — a painful migration. Include the domain, version, entity, and event verb in the name, and you'll thank yourself when you add a v2 schema six months later.

Implementing Kafka in NestJS with @nestjs/microservices

NestJS has first-class Kafka support through the @nestjs/microservices package. You configure a Kafka microservice transport on your application, then use decorators to define message handlers. The setup requires three pieces: the producer service (for publishing events), the consumer module (with @EventPattern or @MessagePattern decorators), and the Kafka client configuration with proper retry and error handling settings.

Error Handling and Dead Letter Queues

Production Kafka without a dead letter queue (DLQ) is a trap. When a consumer throws an unhandled error, the default behavior retries indefinitely — blocking that partition from advancing. Set a bounded retry policy and route failed messages to a DLQ topic (e.g., 'erp.v1.invoice.created.dlq'). A separate DLQ consumer can alert on failures, store them for manual inspection, and replay them after a bug fix. In NestJS, implement this as an error handler in your consumer that catches, logs to the DLQ topic, and acknowledges the original message.

// main.ts — Kafka microservice setup
import { NestFactory } from '@nestjs/core';
import { MicroserviceOptions, Transport } from '@nestjs/microservices';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.createMicroservice<MicroserviceOptions>(AppModule, {
    transport: Transport.KAFKA,
    options: {
      client: {
        clientId: 'erp-service',
        brokers: ['kafka-1:9092', 'kafka-2:9092'],
      },
      consumer: {
        groupId: 'erp-consumer-group',
        retry: { retries: 3 },
      },
    },
  });
  await app.listen();
}

// invoice.consumer.ts — Event handler
@Controller()
export class InvoiceConsumer {
  @EventPattern('erp.v1.invoice.created')
  async handleInvoiceCreated(@Payload() data: InvoiceCreatedEvent) {
    try {
      await this.pdfService.generate(data.invoiceId);
      await this.emailService.sendInvoice(data);
    } catch (error) {
      // Route to DLQ — never retry indefinitely
      await this.dlqService.publish('erp.v1.invoice.created.dlq', data, error);
    }
  }
}

Consumer Groups and Scaling

Consumer groups are Kafka's horizontal scaling mechanism. Add a new instance of your NestJS consumer service, and Kafka automatically rebalances partitions across all instances. The catch: during a rebalance, all consumers in the group stop processing while partitions are reassigned. For low-latency systems, keep consumer group rebalance time short by using the static membership feature (assign each consumer a stable ID so rebalances don't trigger on rolling restarts) and by sizing partitions close to your expected consumer count.

Kafka requires a ZooKeeper or KRaft cluster, proper partition tuning, consumer group rebalancing strategies, monitoring, and schema registry if you want schema evolution. For an ERP handling 500 events per day, a PostgreSQL-backed queue with pg-boss or a simple BullMQ Redis queue gives you 90% of the benefits with 10% of the operational complexity. I use Kafka only when multiple independent services must consume the same event stream — for a monolith or a two-service system, it's overkill.

Schema Management with Schema Registry

As your event-driven system grows, schema evolution becomes the hardest problem. When the 'order.created' event gains a new required field, you have a window where old producers emit events without that field and new consumers expect it. The Confluent Schema Registry (or its open-source equivalents like Apicurio) solves this with schema versioning and compatibility checks. Enforce 'BACKWARD' compatibility — new schemas must be able to read messages written by old schemas. This lets you update consumers before producers without a coordinated deploy.

My Honest Assessment: When to Choose Kafka

Use Kafka when: (1) multiple independent services need to consume the same event stream; (2) you need durable event replay for debugging or building read models; (3) you're processing millions of events daily and need horizontal scale; (4) you have a team with capacity to operate the Kafka cluster. Skip Kafka when: your system has fewer than 3 services consuming events, your throughput is under 10K events/day, or your team doesn't have Kafka operational experience. For mid-sized NestJS applications, BullMQ with Redis is my first choice — it handles queuing, delayed jobs, retries, and priority without the Kafka overhead.

Sources & Further Reading

LinkedIn Engineering — Benchmarking Apache Kafka: 2 Million Writes Per Second — https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Confluent — Kafka Performance: Throughput, Latency, and Test Results — https://developer.confluent.io/learn/kafka-performance/
NestJS Docs — Microservices: Kafka Transport — https://docs.nestjs.com/microservices/kafka

Frequently Asked Questions

Event-Driven Architecture with Kafka and NestJS: A Production Guide

Frequently Asked Questions

Event-Driven Architecture with Kafka and NestJS: A Production Guide

Understanding Kafka's Core Model

Designing Topics and Partitions

Implementing Kafka in NestJS with @nestjs/microservices

Error Handling and Dead Letter Queues

Consumer Groups and Scaling

Schema Management with Schema Registry

My Honest Assessment: When to Choose Kafka

Sources & Further Reading

Understanding Kafka's Core Model

Designing Topics and Partitions

Implementing Kafka in NestJS with @nestjs/microservices

Error Handling and Dead Letter Queues

Consumer Groups and Scaling

Schema Management with Schema Registry

My Honest Assessment: When to Choose Kafka

Sources & Further Reading