Imagine it’s 3:00 AM. Your pager goes off. The dashboard shows a 100% CPU spike on your primary database, followed by a total service outage. You look at the logs and see a weird pattern: the traffic didn’t actually increase, but suddenly every single request started failing at the exact same millisecond.
You’ve just been trampled by the Thundering Herd.
In this post, we’re going to dive into one of the most common yet misunderstood performance bottlenecks in distributed systems: what the Thundering Herd problem is, and how to use a combination of Request Collapsing and Jitter to build systems that don’t collapse under their own weight.
What is the Thundering Herd?
At its core, the Thundering Herd occurs when a large number of processes are waiting for an event, but when the event happens, they all wake up at once. However, only one can actually “handle” the event.
While the term originated in OS kernel scheduling, modern web engineers most frequently encounter it in the form of a Cache Stampede.
The Anatomy of a Crash:
- The Golden State: You have a high-traffic endpoint cached in Redis. Everything is fast.
- The Expiry: The cache TTL (Time-to-Live) hits zero.
-
The Stampede: 5,000 concurrent users refresh the page. They all see a cache miss.
-
The Collapse: All 5,000 requests hit your database simultaneously to re-generate the same data. Your database will experience a surge in load, latency skyrockets, and the service goes down.
Note: 5000 requests is an arbitrary number. The actual number depends on your system’s capacity.
Beyond the Cache: Other Thundering Herd Scenarios
While cache stampedes are the most common, the Thundering Herd can manifest across your entire stack:
1. The “Welcome Back” Surge (Downstream Recovery)
Imagine your primary Auth service goes down for 5 minutes. During this time, every other service in your cluster is failing and retrying. When the Auth service finally comes back up, it is immediately hit by large number of requests per second from all the other services trying to “catch up.” This often knocks the service right back down again—a phenomenon known as a Retry Storm.
2. The Auth Token Expiry
In microservices, many internal services might share a common access token (like a machine-to-machine JWT). If that token has a hard expiry and 50 different microservices all see it expire at the exact same second, they will all “thunder” toward the Identity Provider to get a new one.
3. “Top of the Hour” Scheduled Tasks
A classic ops mistake is scheduling a heavy cleanup cron job to run at 00 * * * * (midnight) across 100 different server nodes. At precisely 12:00:00 AM, your database or shared storage is hit by 100 heavy processes simultaneously.
4. CDN “Warm-up” and Deployment Surge
When you deploy a new version of a 500MB mobile app binary, it isn’t in any CDN edge caches yet. If you immediately notify 1 million users to download it, the first thousands of requests will all miss the edge and hit your origin server at once, potentially melting your storage layer.
How to Detect the Herd (Monitoring & Metrics)
You don’t want your first notification of a thundering herd to be a total outage. Look for these “herd signatures” in your dashboard:
- Correlation of Cache Misses and Latency: A sudden spike in cache miss rates that perfectly aligns with a surge in p99 database latency.
- Connection Pool Exhaustion: If you see your database connection pool hitting its max limit within milliseconds, you likely have a stampede.
- CPU Context Switching: On your application servers, a massive spike in “System CPU” or context switches indicates that thousands of threads are waking up and fighting for the same locks.
- Error Logs: Thousands of “lock wait timeout” or “connection refused” errors occurring in a tight cluster.
Strategy 1: Request Collapsing (The “Wait in Line” Approach)
Request collapsing (also known as Promise Memoization) is the practice of ensuring that for any given resource, only one upstream request is active at a time.
If Request A is already fetching user_data_123 from the database, Requests B, C, and D shouldn’t start their own fetches. Instead, they should “subscribe” to the result of Request A.
The Problem with Naive Collapsing
If you implement a simple lock, you often run into a secondary issue: Busy-Waiting. If 4,999 requests are waiting for that one database call to finish, how do they know when it’s done? If they all check “Is it ready yet?” every 10ms, you’ve just created a new herd in your application memory.
The Solution:
Event-Based NotificationTo fix this, we need to move from a Push model (or Polling) to a Pull/Notification model. Instead of asking “Is it done?”, the waiting requests should simply go to sleep and ask to be woken up when the data is ready.
In Python or Node.js, this is often handled natively by Promises or Futures. In other languages, you might use Condition Variables or Channels.
Here is a Python example using asyncio. Notice how we use a shared Event object. The “followers” simply await the event, consuming zero CPU while they wait for the “leader” to finish the work.
import asyncio
class RequestCollapser:
def __init__(self):
# Stores the events for keys currently being fetched
self.inflight_events = {}
self.cache = {}
async def get_data(self, key):
# 1. Check if data is already in cache
if key in self.cache:
return self.cache[key]
# 2. Check if someone else is already fetching it
if key in self.inflight_events:
print(f"Request for {key} joining the herd (waiting)...")
event = self.inflight_events[key]
await event.wait() # <--- Crucial: Zero CPU usage while waiting
return self.cache.get(key)
# 3. Be the "Leader"
print(f"Request for {key} is the LEADER. Fetching from DB...")
event = asyncio.Event()
self.inflight_events[key] = event
try:
# Simulate DB fetch
await asyncio.sleep(1)
data = "Fresh Data"
self.cache[key] = data
return data
finally:
# 4. Notify the herd
event.set() # Wakes up all waiters instantly
del self.inflight_events[key]
The Giant Herd: Distributed Collapsing
The Python example above works perfectly for a single server. But what if you have 100 app servers? You still have 100 “leaders” hitting your database at once. Which may or may not be a problem, depending on your database. If you want to protect your system from this edge case, you can use distributed locks to ensure only one node in the entire cluster becomes the leader for a specific key.
To solve this at scale, you can use:
- Distributed Locks (Redis/Etcd): Use a library like
Redlockto ensure only one node in the entire cluster becomes the leader for a specific key. - The “Singleflight” Pattern: In Go, the
golang.org/x/sync/singleflightpackage is the gold standard for this. It handles the local collapsing logic efficiently, and when combined with a distributed lock, it protects both your app memory and your database.
Strategy 2: Jitter (The “Social Distancing” for Data)
This is where Jitter comes in. Jitter is the introduction of intentional, controlled randomness to stagger execution.
Staggered Retries
When a request finds that a resource is being “collapsed” (someone else is already fetching it), don’t let it retry on a fixed interval.
- Bad: Retry every 50ms.
- Good: Retry every 50ms + random(0, 20ms).
Staggered Expirations
Never set a hard TTL on a batch of keys. If you update 10,000 products and set them all to expire in exactly 1 hour, you are scheduling a disaster for exactly 60 minutes from now.Instead, use: TTL = 3600 + (rand() * 120). This spreads the “thundering” over a 2-minute window, which your database can likely handle.
The Pro Move: Probabilistic Early Refresh
The most resilient systems I’ve built use a technique called X-Fetch. Instead of waiting for the cache to expire, we use jitter to trigger a refresh slightly before expiration.
As the TTL approaches zero, each request performs a “dice roll.” If the roll is low, that specific request takes the lead, re-fetches the data, and resets the cache. Because the “roll” is random for every user, the probability ensures that only one user triggers the update, while everyone else keeps getting the “stale but safe” data.
import time
import random
async def get_resilient_data(key):
cached = await cache.get(key)
should_refresh = False
# 1. Handle Cache Miss
if cached is None:
should_refresh = True
else:
# 2. Calculate time remaining
time_remaining = cached.expiry - time.time()
# 3. Handle Negative Time (Expired) or Probabilistic Check
if time_remaining <= 0:
should_refresh = True
else:
# Probability increases as time_remaining approaches 0
# Note: We check <= 0 above to avoid DivisionByZero or negative probability
should_refresh = random.random() < (1.0 / time_remaining)
if should_refresh:
try:
# Collapse requests using a distributed lock or local future map
return await collapse_request(key, fetch_from_db)
except Exception:
if cached:
return cached.data # Fallback to stale data on DB failure
raise
return cached.data
Final Defense: Safety Nets
Sometimes, despite your best efforts with Jitter or Collapsing, a herd still breaks through. In those moments, you need a final line of defense to keep your system alive:
- Load Shedding: When your database connection pool is full, don’t keep queuing requests (which just increases latency). Start dropping them with a
503 Service Unavailable. It’s better to fail 10% of users quickly than to make 100% of users wait 30 seconds for a timeout. - Circuit Breakers: If your database is struggling, the circuit breaker “trips” and stops all traffic for a cool-down period. This gives your DB the breathing room it needs to recover without being continuously bombarded by retries.
- Rate Limiting: By capping the number of requests per second (globally or per-user), you ensure that even a massive “herd” can’t exceed your system’s hard limits. Excess requests are throttled with a
429 Too Many Requests, protecting your infrastructure from being overwhelmed.
Choosing Your Weapon: Strategy Comparison
| Strategy | Implementation Complexity | Best Used For… | Main Drawback |
|---|---|---|---|
| Jitter | Low | Retries, TTL Expirations | Doesn’t stop the initial spike, just spreads it. |
| Request Collapsing | Medium | High-traffic single keys (e.g., Homepage) | Can become a complex “leader” bottleneck. |
| X-Fetch (Probabilistic) | High | Mission-critical low-latency data | Adds pre-emptive load to your database. |
Closing Thoughts
Scaling isn’t just about adding more servers; it’s about managing the coordination between them. By implementing Request Collapsing, you protect your downstream resources. By adding Jitter, you protect your coordination layer from itself.
The next time you set a cache TTL, ask yourself: “What happens if 10,000 people ask for this at the same time?” If the answer is “they all wait for the DB,” it’s time to add some jitter.
If you enjoyed this deep dive into systems engineering, feel free to follow for more insights on building resilient distributed systems.
Build More Resilient Systems with Aonnis
If you’re managing complex caching layers and want to avoid the pitfalls of manual scaling and configuration, check out the Aonnis Valkey Operator. It helps you deploy and manage high-performance Valkey compatible clusters on Kubernetes with built-in best practices for reliability and scale.
Surprise: It is free for limited time.
Visit www.aonnis.com to learn more.

