Distributed Cache (Redis-style cluster)
Problem statement
Provide sub-millisecond reads/writes, TTL, pub/sub, optional persistence, and horizontal scale across AZs with graceful failover.
How it works
- Single-threaded Redis process per shard → shard by hash slot (16k slots in Redis Cluster).
- Clients use cluster-aware client (MOVED/ASK redirects).
- Replication: primary + replicas; sentinel or managed failover (ElastiCache Multi-AZ).
Analogy: A row of vending machines (shards): the key’s hash picks the machine; if one breaks, a replica takes over with the same snacks (data).
High-level architecture
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Redis client library | Cluster-aware driver handling MOVED/ASK. | Redis Cluster shards keys; client must follow redirects and maintain connection pools. |
| Primary / Replica shards | Redis primary writes; replicas async-replicate for reads. | Read scaling + HA; replicas may return slightly stale data — acceptable for many caches. |
| Gossip / cluster bus | Nodes exchange topology and failure signals. | Enables automatic failover when a primary dies without manual DNS edits. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Deployment choices
| Option | When |
|---|---|
| ElastiCache Redis Cluster | AWS native, ops minimized |
| Azure Cache for Redis Enterprise | Active geo-replication |
| KeyDB | Multi-threaded, Redis protocol compatible |
| Memcached | Pure cache, no structures — simpler but fewer features |
Cache patterns
- Cache-aside: app reads DB on miss, populates cache.
- Write-through: write cache + DB synchronously — stronger consistency, slower writes.
- Write-behind: write cache first, async flush — risky on crash unless durable queue.
Eviction
- allkeys-lru vs volatile-ttl — choose based on whether every key should be evictable.
Hot keys
- Problem: one celebrity user key on one shard.
- Mitigations: local in-process cache; read replicas; application-level sharding
user:1234:slot{n}.
E2E: cache-aside read
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Thundering herd on expiry | Jitter TTL; singleflight mutex per key |
| Stampede after cold start | Warmup job; probabilistic early refresh |
| Large values | Compress (Snappy); or split hash fields |
Caveats
- Redis is not a queue at scale (BLPOP pitfalls) — prefer SQS / RabbitMQ for durable work queues.
- TLS overhead — use VPC peering + in-transit TLS where compliance demands.
Security
- AUTH password + ACL users per service; no public endpoints; Redis 6 ACL least privilege.