When Your Cache Expires, Your Server Dies
You set a TTL. You feel good about it. You ship it.
Then at 3am, your cache expires and your database gets hit by 50,000 requests at the same time. Alerts go off. The on-call engineer wakes up. Slack is on fire.
This is the thundering herd problem. And basic TTL caching does not protect you from it. Let us talk about what actually does, one strategy at a time.
First, What Actually Goes Wrong?
When you cache something, you give it a time-to-live. Say 60 seconds. Seems fine.
But here is the subtle trap: in a high-traffic system, many users hit your service at the same time. Your cache fills up with entries that were all set at roughly the same moment. Which means they all expire at roughly the same moment too.
| |
At second 60, every key is gone. Every request that arrives finds an empty cache. Every single one of them goes straight to your database. Your database, which was handling maybe 10 queries per second (because 99% of traffic was cached), is now handling 50,000 queries per second. It cannot keep up. Timeouts start. Services back up. What began as a cache expiry becomes a full outage.
This is not a theoretical edge case. It has taken down real systems. IPL streaming on launch day. Flash sales on e-commerce platforms. Product launches where no one thought past “set TTL = 300.”
Here is what basic caching looks like, and why it sets you up for this problem:
| |
There is nothing obviously wrong with this code. But at scale, it is a time bomb. Let us fix it.
Fix 1: TTL Jitter
This is the simplest fix and honestly you should just always do it. The idea is to stop giving every key the exact same TTL. Add a small random offset so keys expire at slightly different times.
| |
Instead of a spike at second 60, you get a gentle trickle of cache misses spread over 15 to 20 seconds. Your database handles it comfortably.
| |
The only tradeoff: some entries stay stale a little longer than others. For most use cases, that is completely fine. If you need data fresh to the exact second, you need something stronger, but jitter alone handles the vast majority of real-world cache stampede scenarios.
Do this everywhere. Today. It costs almost nothing.
Fix 2: Probabilistic Early Expiration
Jitter spreads expiry across a population of keys. But what about a single very hot key? If your homepage hero banner is cached under one key, and that key gets 20,000 requests per second, even a jittered expiry is going to cause a painful spike the moment it expires.
Probabilistic early expiration solves this differently. Instead of waiting for the key to expire, some requests start recomputing it early. The closer the key is to expiring, the more likely any given request will trigger a background refresh.
timeline
title Probability of Early Recomputation Over Time
T=0 : Cache set — 0% chance of early recompute
T=50 : Getting closer — ~5% chance
T=75 : Halfway through stale window — ~20% chance
T=90 : Almost expired — ~40% chance
T=99 : Basically expired — ~90% chance
T=100 : Key gone — must recompute
You do not need to understand the maths to use this pattern. The intuition is: the closer a key is to dying, the more aggressively the system tries to refresh it before it actually dies.
| |
The user always gets a fast response. The recomputation happens in the background. And because it starts before the key actually expires, you almost never get a true cache miss on a hot key.
This is great for a small number of extremely popular keys. It does not replace jitter for the general case.
Fix 3: Mutex Locking
Jitter and probabilistic expiration are both about timing. Mutex locking is about concurrency.
Here is the scenario: one key expires. It is a popular key. In the gap between “key expired” and “key refreshed,” you have 5,000 requests all trying to recompute it at the same time. That is 5,000 database queries for the exact same data, all running in parallel.
Mutex locking says: only one request is allowed to do the work. Everyone else waits.
flowchart TD
R1[Request 1] --> C{Cache hit?}
R2[Request 2] --> C
R3[Request 3] --> C
C -->|Miss| L{Lock acquired?}
L -->|Yes — Request 1 wins| DB[(Database)]
L -->|No — Request 2 waits| W[Wait...]
L -->|No — Request 3 waits| W
DB --> S[Set cache]
S --> RL[Release lock]
RL --> W
W --> C2{Cache populated now?}
C2 -->|Yes| Resp[Serve from cache]
C -->|Hit| Resp
Request 1 grabs the lock and does the work. Requests 2 through 5,000 wait patiently. When Request 1 finishes, it writes to cache and releases the lock. Everyone else reads from cache. Your database sees exactly one query instead of five thousand.
| |
The tradeoff is that waiting requests take slightly longer. If your recomputation takes 500ms, every waiting request is delayed by roughly 500ms. For most systems, that is acceptable. A short delay beats your database going down.
One thing to pay close attention to in the code above: the finally block. If your fetch function throws, you still need to release the lock. Without that, the lock stays held until LOCK_TTL expires, and no request can refresh the cache in that window.
Fix 4: Stale-While-Revalidate
This is probably the most elegant strategy, and it is how CDNs have operated for decades.
The idea is beautifully simple: serve the old value immediately, then recompute in the background.
sequenceDiagram
participant User
participant Cache
participant Background Worker
participant Database
User->>Cache: Request data
Cache-->>User: Serve stale value right away (fast!)
Cache->>Background Worker: Go refresh this quietly
Background Worker->>Database: Fetch fresh data
Database-->>Background Worker: Here you go
Background Worker->>Cache: Updated
Note over User: User got their response in ~1ms.<br/>Next request will get the fresh value.
You define two TTLs. The “fresh” TTL is how long the data is considered current. After that, you enter a “stale” grace period: the data is old, but you still serve it while you refresh in the background. After the stale window ends, the key is truly gone and you fall back to a blocking fetch.
| |
Netflix does this constantly. When you open the app, you are probably seeing data that is a few seconds old. But the page loads instantly. In the background, your recommendations are silently being refreshed for the next time you scroll.
The SWR React library is named after this exact pattern. Same concept, applied to client-side data fetching.
The tradeoff is data freshness. Users can see slightly stale data. For inventory counts or live pricing, this is probably a bad idea. For trending lists, recommendation carousels, or user profiles, it is a great trade. Fast experience, invisible updates, almost no stale data in practice.
Fix 5: Cache Warming
The previous four strategies all deal with expiry management. This one is about a completely different problem: what happens when your cache is empty to begin with?
You deploy a new version of your service. Cache is cold. Every request is a miss. Your database, which normally handles 2% of traffic (because 98% is cached), is suddenly handling 100% of it. It was not provisioned for that. Services degrade. Users notice.
Or you run a planned maintenance window. Cache gets flushed. Traffic resumes before the cache rebuilds. Same problem.
Cache warming is the practice of proactively populating the cache before traffic arrives.
flowchart LR
subgraph Before Traffic Arrives
W[Warm-up Script] -->|Fetch top products| DB[(Database)]
DB --> W
W -->|Pre-populate| C[Cache Layer]
end
subgraph After Launch
U[User Requests] --> C
C -->|Hit — fast| R[Response]
C -.->|Miss — rare| DB
end
| |
For an IPL match day, you warm the match schedule, player profiles, and video stream metadata before the match starts. For a midnight flash sale, you warm the entire sale catalogue an hour before it goes live. When the spike hits, the cache is already full and your database barely notices.
The challenge is knowing what to warm. Use your analytics. If you are warming 1,000 products but the top 10 drive 80% of traffic, you absolutely need to get those top 10 right. If you do not have production data yet, warm broadly and tighten it as you learn.
When to Reach for What
flowchart TD
Start[What problem are you solving?] --> A{Many keys expiring at once?}
A -->|Yes| J[TTL Jitter — do this by default]
A -->|No| B{One hot key causing a stampede?}
B -->|Yes, and latency is critical| S[Stale-While-Revalidate]
B -->|Yes, and consistency matters| M[Mutex Locking]
B -->|Want to prevent any miss window| P[Probabilistic Early Expiration]
Start --> E{Cold start or planned spike incoming?}
E -->|Yes| W[Cache Warming]
Quick rules of thumb:
Always add TTL jitter. It is a two-line change and it prevents an entire class of problems. No reason not to.
Stale-While-Revalidate is the right default for most user-facing data. Fast responses, invisible refreshes, happy users. Use it for anything where data does not need to be realtime.
Mutex locking is for high-concurrency hot keys where you cannot serve stale data. Pair it with jitter on the cache TTL when you set the refreshed value.
Probabilistic early expiration is more niche. Reach for it when you have a few extremely hot keys and want to keep them continuously warm with no visible miss window.
Cache warming is operational. Put it in your deployment checklist. Run it after any cache flush. Build it into your incident runbook.
The Tradeoff Table
Every strategy makes a different bet on freshness vs latency vs safety.
| Strategy | Freshness | Latency | DB Protection |
|---|---|---|---|
| Basic TTL | High | Low (on hit) | None |
| TTL Jitter | High | Low (on hit) | Moderate |
| Probabilistic Early Expiry | High | Low | Good |
| Mutex Locking | High | Slightly higher on miss | Excellent |
| Stale-While-Revalidate | Medium | Always low | Excellent |
| Cache Warming | High initially | Always low | Excellent |
There is no universally correct answer. A payments service and a recommendation engine have very different requirements. The right combination depends on what your users actually feel and what your database can actually handle.