Event-Driven Architecture: When NOT to Use It

Photo by Josh Sorenson

Photo by Josh Sorenson
Event-driven architecture is the pattern everyone wants on their CV and almost nobody interrogates. The pitch is seductive: decoupled services, infinite scalability, systems that react to the world instead of polling it. I have shipped event-driven flows in production and I will keep doing so. But I have also watched teams, including a past version of me, take a perfectly understandable CRUD system and turn it into a distributed murder mystery where nobody can answer the simplest operational question: what happened, in what order, and why.
This post is the decision framework I wish someone had handed me before I wired my first message broker into an ERP. The short answer up front: if your system fits in one database, your team fits in one room, and your workflows need answers now rather than eventually, you almost certainly do not need events yet. Here is how to know for sure.
Let us be fair to the pattern first. Martin Fowler's classic breakdown distinguishes event notification, event-carried state transfer, event sourcing, and CQRS, and each solves a specific coupling problem. Event notification lets a producer announce that something happened without knowing or caring who listens. Event-carried state transfer lets consumers keep their own copy of data so they can answer queries without calling the source system. These are real, valuable properties when you have many teams and many services stepping on each other.
AWS frames the same idea operationally: producers, routers, and consumers that scale, fail, and deploy independently. If your image-processing pipeline gets 100x traffic spikes while your billing service does not, putting a queue between them is obviously right. The question is never whether events work. The question is whether the problems they solve are problems you actually have today.
Every architectural pattern is a trade. Here is what you pay for events, and you pay it on day one, not at scale.
Debugging becomes archaeology
A synchronous call gives you a stack trace. An event gives you a correlation ID, if you remembered to propagate one, scattered across three services and a broker. Your median time-to-diagnosis for production incidents goes up the day you go async, and it never fully comes back down.
Eventual consistency leaks into UX
Users do not think in eventual consistency. They click approve, refresh the page, and expect the world to have changed. Every async boundary you add is a place where the UI can show stale data, and you will spend real engineering time papering over it with optimistic updates and polling.
Failure handling multiplies
Synchronous failure is one code path: the call failed, show an error. Async failure is a zoo: redelivery, out-of-order delivery, duplicate delivery, poison messages, dead-letter queues, consumers that silently stop. Each needs design, code, and monitoring.
Operational surface grows
A broker is another stateful service to deploy, patch, monitor, and back up. For a small team on a VPS budget, Kafka or even a managed queue is a standing tax on attention that a Postgres table simply is not.
From my own scar tissue building NestJS and PostgreSQL systems for Indonesian SMBs, these are the red flags that an event-driven design is about to subtract value:
The most expensive version of this mistake is event sourcing as the system of record. If you adopt it without a hard requirement like a full audit-by-replay mandate, you are signing up for schema evolution of immutable history. Reversing that decision later means a data migration project, not a refactor.
This is the table I draw on the whiteboard when a team is deciding. Neither column is the good column; they are different bills to pay.
| Dimension | Synchronous (REST/transaction) | Event-driven (broker) |
|---|---|---|
| Consistency | Immediate, transactional where it counts | Eventual; UX and reporting must tolerate lag |
| Debugging | One stack trace, one log stream | Correlation IDs across services, broker dashboards, replay tooling |
| Failure modes | Call fails, caller decides; timeouts cascade under load | Retries, duplicates, ordering, poison messages, silent consumer death |
| Team coupling | Caller must know callee API and uptime | Teams ship independently; contract moves into event schemas |
| Ops cost for a small team | Whatever your app already costs | Broker hosting, monitoring, upgrades, plus on-call knowledge |
In one ERP build, purchase order approval originally published a POApproved event that an inventory consumer picked up to reserve stock. On paper, beautiful decoupling. In practice, both services were maintained by the same three engineers and deployed from the same monorepo. When a malformed payload crashed the consumer, retries delayed stock reservation by 40 seconds while procurement staff stared at free stock that should have been locked.
We replaced the event with a single database transaction that updated the purchase order and the stock reservation together. The diagram below is the before and after. Incident class deleted, latency gone, and the code shrank by a third. We kept events where they genuinely earned their place: notifying the reporting warehouse and sending emails, where a 40-second delay costs nothing.
-- What "eventually consistent" looked like in practice:
-- the PO was approved, but stock reservation lagged 40s behind
-- because a consumer was retrying a poison message.
Approval service Inventory service User sees
──────────────── ───────────────── ─────────
10:00:00 PO approved
10:00:00 emit POApproved
10:00:01 crash (bad payload)
10:00:05 retry 1 → crash
10:00:15 retry 2 → crash
10:00:40 retry 3 → ok "Why is stock
10:00:40 reserve stock still free?!"
-- The synchronous version is one transaction:
BEGIN;
UPDATE purchase_orders SET status = 'approved' WHERE id = $1;
UPDATE stock_items SET reserved = reserved + qty WHERE ...;
COMMIT; -- consistent at 10:00:00, every timeBefore any queue or topic enters a design of mine, it has to clear these five questions, in order:
Start with a transactional outbox on top of PostgreSQL before reaching for a broker. A jobs table plus a polling worker gives you async processing, retries, and exactly-one-database to operate. You can graduate to BullMQ or Kafka later by swapping the transport, because the outbox pattern already forced you to design idempotent handlers.
So that this is not a hit piece: I run event-driven flows happily in production today, and they earn their keep in three places.
Event-driven architecture is a tool for managing organizational and load-profile complexity, not a maturity badge. The systems I am proudest of started as monoliths with one PostgreSQL database, added a transactional outbox when async work appeared, and only introduced a real broker when a concrete fan-out or load-leveling need showed up with numbers attached.
The takeaway: every async boundary you add must pay rent in a real decoupling or scaling problem it solves. If it cannot, keep the call synchronous, keep the transaction, and enjoy being able to read your own system's behavior from a single stack trace. Your 2 AM self will thank you.
Sources and further reading