When should you avoid event-driven architecture entirely?

Avoid events when the producer and consumer are owned by the same team and deploy together, when the business flow needs an immediate authoritative answer (such as stock reservation or payment authorization), or when your entire write load fits comfortably in a single PostgreSQL instance. If your system fits in one database and your team fits in one room, a synchronous call or database transaction almost always serves you better.

What are the hidden operational costs of adopting a message broker?

A broker is an additional stateful service that must be deployed, patched, monitored, and backed up. Debugging shifts from reading a single stack trace to chasing correlation IDs across multiple services and a broker dashboard. You also inherit a zoo of async failure modes — retries, duplicate delivery, out-of-order messages, poison messages, and dead-letter queues — each requiring its own design, code, and monitoring from day one.

How did replacing a Kafka event with a database transaction improve a real ERP system?

In one ERP build, a POApproved event triggered an inventory consumer to reserve stock, but since both services were maintained by the same three engineers from the same monorepo, a malformed payload could delay stock reservation by 40 seconds via retries. Replacing the event with a single database transaction eliminated that incident class entirely, removed the latency, and reduced the codebase by a third. Events were kept only where delay was acceptable — reporting warehouse notifications and emails.

What is the recommended stepping stone before reaching for a full message broker?

Start with a transactional outbox pattern on top of PostgreSQL: a jobs table plus a polling worker gives you async processing and retries with exactly one database to operate. You can later graduate to BullMQ or Kafka by swapping the transport layer, because the outbox pattern already forces you to design idempotent handlers — the hardest part of any event-driven system.

When do events genuinely earn their place in a system?

Events are justified when there is a hard latency or availability mismatch between producer and consumer (such as spiky uploads feeding slow processing), when producer and consumer are owned by different teams with different deploy cadences, and when the business outcome can tolerate minutes of delay and occasional reordering. Notifying a reporting warehouse or sending emails are classic examples where a 40-second delay costs nothing and decoupling genuinely pays off.

Event-Driven Architecture: When NOT to Use It

Event-driven architecture is the pattern everyone wants on their CV and almost nobody interrogates. The pitch is seductive: decoupled services, infinite scalability, systems that react to the world instead of polling it. I have shipped event-driven flows in production and I will keep doing so. But I have also watched teams, including a past version of me, take a perfectly understandable CRUD system and turn it into a distributed murder mystery where nobody can answer the simplest operational question: what happened, in what order, and why.

This post is the decision framework I wish someone had handed me before I wired my first message broker into an ERP. The short answer up front: if your system fits in one database, your team fits in one room, and your workflows need answers now rather than eventually, you almost certainly do not need events yet. Here is how to know for sure.

What Events Actually Buy You

Let us be fair to the pattern first. Martin Fowler's classic breakdown distinguishes event notification, event-carried state transfer, event sourcing, and CQRS, and each solves a specific coupling problem. Event notification lets a producer announce that something happened without knowing or caring who listens. Event-carried state transfer lets consumers keep their own copy of data so they can answer queries without calling the source system. These are real, valuable properties when you have many teams and many services stepping on each other.

AWS frames the same idea operationally: producers, routers, and consumers that scale, fail, and deploy independently. If your image-processing pipeline gets 100x traffic spikes while your billing service does not, putting a queue between them is obviously right. The question is never whether events work. The question is whether the problems they solve are problems you actually have today.

The Four Costs Nobody Puts on the Slide

Every architectural pattern is a trade. Here is what you pay for events, and you pay it on day one, not at scale.

Debugging becomes archaeology

A synchronous call gives you a stack trace. An event gives you a correlation ID, if you remembered to propagate one, scattered across three services and a broker. Your median time-to-diagnosis for production incidents goes up the day you go async, and it never fully comes back down.

Eventual consistency leaks into UX

Users do not think in eventual consistency. They click approve, refresh the page, and expect the world to have changed. Every async boundary you add is a place where the UI can show stale data, and you will spend real engineering time papering over it with optimistic updates and polling.

Failure handling multiplies

Synchronous failure is one code path: the call failed, show an error. Async failure is a zoo: redelivery, out-of-order delivery, duplicate delivery, poison messages, dead-letter queues, consumers that silently stop. Each needs design, code, and monitoring.

Operational surface grows

A broker is another stateful service to deploy, patch, monitor, and back up. For a small team on a VPS budget, Kafka or even a managed queue is a standing tax on attention that a Postgres table simply is not.

Five Signs Events Will Make Your System Worse

From my own scar tissue building NestJS and PostgreSQL systems for Indonesian SMBs, these are the red flags that an event-driven design is about to subtract value:

The producer and consumer are the same team and deploy together. You are paying the full decoupling tax to decouple nothing. A function call or a database transaction would do.
The business flow needs an immediate, authoritative answer. Stock reservation, payment authorization, seat booking: if the user must know the outcome before moving on, an event in the middle only adds latency and failure modes.
You need strict ordering across entities. Brokers generally guarantee ordering per partition or per queue at best. If your domain logic breaks when message B arrives before message A, you will end up rebuilding a serialization layer that your database gave you for free.
Your entire write load fits comfortably in one PostgreSQL instance. Most SMB systems peak below a few hundred writes per second. Postgres handles that on a 4 vCPU box with room to spare. Scalability you do not need is not an asset, it is inventory.
Nobody on the team has operated a broker in production before. The first poison-message incident at 2 AM is not the time to learn how dead-letter queues work.

The most expensive version of this mistake is event sourcing as the system of record. If you adopt it without a hard requirement like a full audit-by-replay mandate, you are signing up for schema evolution of immutable history. Reversing that decision later means a data migration project, not a refactor.

Synchronous Calls vs Events: an Honest Comparison

This is the table I draw on the whiteboard when a team is deciding. Neither column is the good column; they are different bills to pay.

Dimension	Synchronous (REST/transaction)	Event-driven (broker)
Consistency	Immediate, transactional where it counts	Eventual; UX and reporting must tolerate lag
Debugging	One stack trace, one log stream	Correlation IDs across services, broker dashboards, replay tooling
Failure modes	Call fails, caller decides; timeouts cascade under load	Retries, duplicates, ordering, poison messages, silent consumer death
Team coupling	Caller must know callee API and uptime	Teams ship independently; contract moves into event schemas
Ops cost for a small team	Whatever your app already costs	Broker hosting, monitoring, upgrades, plus on-call knowledge

A Real Example: the Approval Flow That Did Not Need Kafka

In one ERP build, purchase order approval originally published a POApproved event that an inventory consumer picked up to reserve stock. On paper, beautiful decoupling. In practice, both services were maintained by the same three engineers and deployed from the same monorepo. When a malformed payload crashed the consumer, retries delayed stock reservation by 40 seconds while procurement staff stared at free stock that should have been locked.

We replaced the event with a single database transaction that updated the purchase order and the stock reservation together. The diagram below is the before and after. Incident class deleted, latency gone, and the code shrank by a third. We kept events where they genuinely earned their place: notifying the reporting warehouse and sending emails, where a 40-second delay costs nothing.

-- What "eventually consistent" looked like in practice:
-- the PO was approved, but stock reservation lagged 40s behind
-- because a consumer was retrying a poison message.

Approval service     Inventory service        User sees
────────────────     ─────────────────        ─────────
10:00:00 PO approved
10:00:00 emit POApproved
                     10:00:01 crash (bad payload)
                     10:00:05 retry 1 → crash
                     10:00:15 retry 2 → crash
                     10:00:40 retry 3 → ok     "Why is stock
                     10:00:40 reserve stock     still free?!"

-- The synchronous version is one transaction:
BEGIN;
UPDATE purchase_orders SET status = 'approved' WHERE id = $1;
UPDATE stock_items SET reserved = reserved + qty WHERE ...;
COMMIT;  -- consistent at 10:00:00, every time

The Decision Checklist I Actually Use

Before any queue or topic enters a design of mine, it has to clear these five questions, in order:

Is there a hard latency or availability mismatch between producer and consumer, like spiky uploads feeding slow processing? If not, stop here and use a call or a transaction.
Can the business outcome tolerate minutes of delay and occasional reordering? If the answer is no, the step stays synchronous no matter how elegant the topic diagram looks.
Are producer and consumer owned by different teams with different deploy cadences? Decoupling teams is the strongest legitimate reason for events. Decoupling your own code from itself is not.
Do we have idempotent consumers designed before the first message flows? At-least-once delivery means duplicates are a certainty, not an edge case.
Do we have the operational budget: monitoring for consumer lag, a dead-letter strategy, and someone who can drain a queue under pressure? If the honest answer is no, defer the pattern, not the feature.

Start with a transactional outbox on top of PostgreSQL before reaching for a broker. A jobs table plus a polling worker gives you async processing, retries, and exactly-one-database to operate. You can graduate to BullMQ or Kafka later by swapping the transport, because the outbox pattern already forced you to design idempotent handlers.

Where Events Genuinely Win

So that this is not a hit piece: I run event-driven flows happily in production today, and they earn their keep in three places.

Fan-out side effects: one business fact, many independent reactions. An invoice gets paid; accounting, email, analytics, and the customer portal all care, and none should block the others.
Load leveling: bursty producers feeding rate-limited consumers, like bulk imports trickling into a third-party API that allows five requests per second.
Cross-system integration: when the consumer is another company's system or a legacy ERP you cannot call synchronously, an event log is the cleanest contract you can offer.

Boring First, Events When They Earn It

Event-driven architecture is a tool for managing organizational and load-profile complexity, not a maturity badge. The systems I am proudest of started as monoliths with one PostgreSQL database, added a transactional outbox when async work appeared, and only introduced a real broker when a concrete fan-out or load-leveling need showed up with numbers attached.

The takeaway: every async boundary you add must pay rent in a real decoupling or scaling problem it solves. If it cannot, keep the call synchronous, keep the transaction, and enjoy being able to read your own system's behavior from a single stack trace. Your 2 AM self will thank you.

Sources and further reading

Frequently Asked Questions

Event-Driven Architecture: When NOT to Use It

Frequently Asked Questions

Event-Driven Architecture: When NOT to Use It

What Events Actually Buy You

The Four Costs Nobody Puts on the Slide

Five Signs Events Will Make Your System Worse

Synchronous Calls vs Events: an Honest Comparison

A Real Example: the Approval Flow That Did Not Need Kafka

The Decision Checklist I Actually Use

Where Events Genuinely Win

Boring First, Events When They Earn It

What Events Actually Buy You

The Four Costs Nobody Puts on the Slide

Five Signs Events Will Make Your System Worse

Synchronous Calls vs Events: an Honest Comparison

A Real Example: the Approval Flow That Did Not Need Kafka

The Decision Checklist I Actually Use

Where Events Genuinely Win

Boring First, Events When They Earn It