Event-driven architecture looks simple on whiteboards, but in production, it can fail quietly and expensively.
At one fintech company, a misconfigured event topic caused thousands of duplicate transactions to be recorded. The issue went unnoticed until customers began reporting incorrect balances, triggering a week-long investigation. The root cause wasn’t the technology itself—it was an assumption about how events behave in the real world.
Most teams don’t lose systems because they chose events.
They lose systems because they assume events behave like function calls.
In theory, an event is published, consumed, and processed.
However, in reality, events are duplicated, delayed, reordered, partially processed, or replayed long after the context that created them has changed.
If your system isn’t designed for those realities, it will eventually corrupt state, confuse users, or wake someone up at 2 a.m.
This post bridges the gap between theory and practice by outlining the most common failure modes I’ve observed in event-driven systems—and what engineers often underestimate when migrating from synchronous to asynchronous architectures.
1. Events Are Not Delivered Exactly Once
One of the most dangerous assumptions is that an event will be processed only once.
In production:
- Brokers retry
- Consumers crash mid-processing
- Networks fail
- Deployments restart services
The same event can be processed multiple times—or not at all.
Systems that rely on “exactly once” delivery often depend on fragile workarounds or hidden states that collapse under retry conditions. Production-safe systems assume at-least-once delivery and design for it explicitly.
A useful self-check:
- Can your consumer safely process the same event twice?
- Do you detect duplicates?
- Do retries change system state?
If the answer isn’t clear, that’s already a risk.
2. Ordering Is Not Guaranteed (Even When You Think It Is)
Another common failure mode is relying on event order.
Events that logically happen in sequence can arrive:
- Out of order
- On different consumers
- With partial or missing context
If your logic assumes “this event always comes after that one,” you’re encoding a hidden dependency that will eventually break—during backfills, replays, or incident recovery.
Robust consumers treat each event as an independent input.
if (validateState(event)) {
processEvent(event);
} else {
handleInvalidState(event);
}
`
They validate state before acting and handle late or missing data gracefully.
3. Partial Failure Is the Default State
In synchronous systems, failures are obvious.
In event-driven systems, failures are often invisible until damage is done.
Examples:
- A database write succeeds, but a downstream call fails
- A retry reprocesses an event with changed business rules
- One consumer succeeds while another silently fails
These partial failures create inconsistent state across services and are among the hardest bugs to diagnose—often surfacing days or weeks later.
Production systems need:
- Explicit retry boundaries
- Clear failure classification
- Compensating actions where needed
Failure isn’t the exception in async systems—it’s the baseline.
4. Observability Is Usually Added Too Late
Logging strategies that work for REST APIs often fail completely in event-driven systems.
By the time something looks wrong:
- The triggering event is hours old
- The consumer has retried multiple times
- The original request context is gone
In production, effective async observability requires:
- Correlation IDs flowing across services
- Structured logs per processing stage
- Metrics that reflect retries, delays, and dead-lettering
Observability isn’t optional plumbing—it’s a core feature.
5. Event-Driven Systems Fail Quietly
The most dangerous failures don’t throw errors.
They:
- Process events successfully—but incorrectly
- Update state in subtle ways
- Surface weeks later as data or business inconsistencies
This is why event consumers must be:
- Idempotent
- Defensive
- Explicit about assumptions
If correctness depends on timing, ordering, or “this should never happen”, it eventually will.
Closing Thought
Event-driven architecture isn’t hard because the tools are complex.
It’s hard because it forces engineers to confront uncertainty, partial truth, time as a variable, and failure as a normal state.
A simple first step:
Audit one consumer this week and ask, “What happens if this event is replayed?”
In the next post, I’ll dive deeper into designing idempotent consumers—essential for handling retries, duplication, and replays in production systems.
If you’ve faced these issues in real systems, I’d love to hear your experience.
