Why Event-Driven Systems Fail in Production

Event-driven architecture looks simple on whiteboards, but in production, it can fail quietly and expensively.

At one fintech company, a misconfigured event topic caused thousands of duplicate transactions to be recorded. The issue went unnoticed until customers began reporting incorrect balances, triggering a week-long investigation. The root cause wasn’t the technology itself—it was an assumption about how events behave in the real world.

Most teams don’t lose systems because they chose events.

They lose systems because they assume events behave like function calls.

In theory, an event is published, consumed, and processed.

However, in reality, events are duplicated, delayed, reordered, partially processed, or replayed long after the context that created them has changed.

If your system isn’t designed for those realities, it will eventually corrupt state, confuse users, or wake someone up at 2 a.m.

This post bridges the gap between theory and practice by outlining the most common failure modes I’ve observed in event-driven systems—and what engineers often underestimate when migrating from synchronous to asynchronous architectures.

1. Events Are Not Delivered Exactly Once

One of the most dangerous assumptions is that an event will be processed only once.

In production:

Brokers retry
Consumers crash mid-processing
Networks fail
Deployments restart services

The same event can be processed multiple times—or not at all.

Systems that rely on “exactly once” delivery often depend on fragile workarounds or hidden states that collapse under retry conditions. Production-safe systems assume at-least-once delivery and design for it explicitly.

A useful self-check:

Can your consumer safely process the same event twice?
Do you detect duplicates?
Do retries change system state?

If the answer isn’t clear, that’s already a risk.

2. Ordering Is Not Guaranteed (Even When You Think It Is)

Another common failure mode is relying on event order.

Events that logically happen in sequence can arrive:

Out of order
On different consumers
With partial or missing context

If your logic assumes “this event always comes after that one,” you’re encoding a hidden dependency that will eventually break—during backfills, replays, or incident recovery.

Robust consumers treat each event as an independent input.

if (validateState(event)) {
  processEvent(event);
} else {
  handleInvalidState(event);
}

They validate state before acting and handle late or missing data gracefully.

3. Partial Failure Is the Default State

In synchronous systems, failures are obvious.
In event-driven systems, failures are often invisible until damage is done.

Examples:

A database write succeeds, but a downstream call fails
A retry reprocesses an event with changed business rules
One consumer succeeds while another silently fails

These partial failures create inconsistent state across services and are among the hardest bugs to diagnose—often surfacing days or weeks later.

Production systems need:

Explicit retry boundaries
Clear failure classification
Compensating actions where needed

Failure isn’t the exception in async systems—it’s the baseline.

4. Observability Is Usually Added Too Late

Logging strategies that work for REST APIs often fail completely in event-driven systems.

By the time something looks wrong:

The triggering event is hours old
The consumer has retried multiple times
The original request context is gone

In production, effective async observability requires:

Correlation IDs flowing across services
Structured logs per processing stage
Metrics that reflect retries, delays, and dead-lettering

Observability isn’t optional plumbing—it’s a core feature.

5. Event-Driven Systems Fail Quietly

The most dangerous failures don’t throw errors.

They:

Process events successfully—but incorrectly
Update state in subtle ways
Surface weeks later as data or business inconsistencies

This is why event consumers must be:

Idempotent
Defensive
Explicit about assumptions

If correctness depends on timing, ordering, or “this should never happen”, it eventually will.

Closing Thought

Event-driven architecture isn’t hard because the tools are complex.
It’s hard because it forces engineers to confront uncertainty, partial truth, time as a variable, and failure as a normal state.

A simple first step:
Audit one consumer this week and ask, “What happens if this event is replayed?”

In the next post, I’ll dive deeper into designing idempotent consumers—essential for handling retries, duplication, and replays in production systems.

If you’ve faced these issues in real systems, I’d love to hear your experience.