Apache Kafka can sustain 2 million writes per second on three commodity machines — a benchmark LinkedIn published that still holds up today. That number tends to make developers reach for Kafka immediately. But I've learned that throughput isn't the question you should be asking first. The first question is: does your problem actually need an event bus? I've built ERP systems at Commsult Indonesia that use simple database-based queues for email triggers and invoice generation — and that's the right call for those workloads. Kafka earns its complexity when you have multiple services that need to react to the same events independently, when you need durable event replay, or when you're processing millions of events daily. This guide covers when Kafka genuinely adds value, and how to implement it correctly with NestJS.
Kafka's architecture is built around topics, partitions, and consumer groups. A topic is a category of events — think 'order.created' or 'invoice.paid'. Topics are divided into partitions, which are the unit of parallelism. Each partition is an ordered, immutable sequence of records that Kafka retains for a configurable period (default 7 days). Consumer groups allow multiple service instances to read from a topic in parallel — each partition is assigned to exactly one consumer in a group at any given time. This means you can scale consumers horizontally by adding partitions.
Topic design is where most teams make mistakes. Too few partitions limits parallelism — if you have one partition, only one consumer can read from it at a time. Too many partitions increases memory overhead on the broker side and complicates rebalancing. A reasonable starting point: set partition count equal to the peak consumer concurrency you expect. For a payment processing service with 8 consumer instances, start with 8 partitions. Use a meaningful partition key — typically a tenant ID, user ID, or entity ID — so related events land on the same partition and preserve ordering for that entity.
Kafka Event-Driven Architecture (NestJS)
Producers (NestJS Services)
┌───────────────────────────┐
│ InvoiceService │──── erp.v1.invoice.created ──┐
│ OrderService │──── erp.v1.order.placed ──┤
│ PaymentService │──── erp.v1.payment.received─┘
└───────────────────────────┘
│
┌────────▼────────┐
│ Kafka Broker │
│ (3 nodes, HA) │
│ │
│ Topic Partitions │
│ P0 | P1 | P2 │
└────────┬────────┘
│
┌───────────────────────────────┼──────────────────────────┐
│ │ │
┌─────▼─────┐ ┌──────▼──────┐ ┌───────▼──────┐
│ Notification│ │ PDF Report │ │ Audit Log │
│ Consumer │ │ Consumer │ │ Consumer │
│ (Group: A) │ │ (Group: B) │ │ (Group: C) │
└────────────┘ └─────────────┘ └──────────────┘
Dead Letter Queue: erp.v1.invoice.created.dlq
(Failed messages routed here for inspection & replay)From my experience building event-driven modules for our ERP at Commsult: use descriptive, versioned topic names like 'erp.v1.invoice.created' from the start. Renaming topics in production requires coordinating all producers and consumers simultaneously — a painful migration. Include the domain, version, entity, and event verb in the name, and you'll thank yourself when you add a v2 schema six months later.
NestJS has first-class Kafka support through the @nestjs/microservices package. You configure a Kafka microservice transport on your application, then use decorators to define message handlers. The setup requires three pieces: the producer service (for publishing events), the consumer module (with @EventPattern or @MessagePattern decorators), and the Kafka client configuration with proper retry and error handling settings.
Production Kafka without a dead letter queue (DLQ) is a trap. When a consumer throws an unhandled error, the default behavior retries indefinitely — blocking that partition from advancing. Set a bounded retry policy and route failed messages to a DLQ topic (e.g., 'erp.v1.invoice.created.dlq'). A separate DLQ consumer can alert on failures, store them for manual inspection, and replay them after a bug fix. In NestJS, implement this as an error handler in your consumer that catches, logs to the DLQ topic, and acknowledges the original message.
// main.ts — Kafka microservice setup
import { NestFactory } from '@nestjs/core';
import { MicroserviceOptions, Transport } from '@nestjs/microservices';
import { AppModule } from './app.module';
async function bootstrap() {
const app = await NestFactory.createMicroservice<MicroserviceOptions>(AppModule, {
transport: Transport.KAFKA,
options: {
client: {
clientId: 'erp-service',
brokers: ['kafka-1:9092', 'kafka-2:9092'],
},
consumer: {
groupId: 'erp-consumer-group',
retry: { retries: 3 },
},
},
});
await app.listen();
}
// invoice.consumer.ts — Event handler
@Controller()
export class InvoiceConsumer {
@EventPattern('erp.v1.invoice.created')
async handleInvoiceCreated(@Payload() data: InvoiceCreatedEvent) {
try {
await this.pdfService.generate(data.invoiceId);
await this.emailService.sendInvoice(data);
} catch (error) {
// Route to DLQ — never retry indefinitely
await this.dlqService.publish('erp.v1.invoice.created.dlq', data, error);
}
}
}Consumer groups are Kafka's horizontal scaling mechanism. Add a new instance of your NestJS consumer service, and Kafka automatically rebalances partitions across all instances. The catch: during a rebalance, all consumers in the group stop processing while partitions are reassigned. For low-latency systems, keep consumer group rebalance time short by using the static membership feature (assign each consumer a stable ID so rebalances don't trigger on rolling restarts) and by sizing partitions close to your expected consumer count.
Kafka requires a ZooKeeper or KRaft cluster, proper partition tuning, consumer group rebalancing strategies, monitoring, and schema registry if you want schema evolution. For an ERP handling 500 events per day, a PostgreSQL-backed queue with pg-boss or a simple BullMQ Redis queue gives you 90% of the benefits with 10% of the operational complexity. I use Kafka only when multiple independent services must consume the same event stream — for a monolith or a two-service system, it's overkill.
As your event-driven system grows, schema evolution becomes the hardest problem. When the 'order.created' event gains a new required field, you have a window where old producers emit events without that field and new consumers expect it. The Confluent Schema Registry (or its open-source equivalents like Apicurio) solves this with schema versioning and compatibility checks. Enforce 'BACKWARD' compatibility — new schemas must be able to read messages written by old schemas. This lets you update consumers before producers without a coordinated deploy.
Use Kafka when: (1) multiple independent services need to consume the same event stream; (2) you need durable event replay for debugging or building read models; (3) you're processing millions of events daily and need horizontal scale; (4) you have a team with capacity to operate the Kafka cluster. Skip Kafka when: your system has fewer than 3 services consuming events, your throughput is under 10K events/day, or your team doesn't have Kafka operational experience. For mid-sized NestJS applications, BullMQ with Redis is my first choice — it handles queuing, delayed jobs, retries, and priority without the Kafka overhead.