Distributed Cache (Redis-style cluster)

Problem statement

Provide sub-millisecond reads/writes, TTL, pub/sub, optional persistence, and horizontal scale across AZs with graceful failover.

How it works

Single-threaded Redis process per shard → shard by hash slot (16k slots in Redis Cluster).
Clients use cluster-aware client (MOVED/ASK redirects).
Replication: primary + replicas; sentinel or managed failover (ElastiCache Multi-AZ).

Analogy: A row of vending machines (shards): the key’s hash picks the machine; if one breaks, a replica takes over with the same snacks (data).

High-level architecture

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Redis client library	Cluster-aware driver handling MOVED/ASK.	Redis Cluster shards keys; client must follow redirects and maintain connection pools.
Primary / Replica shards	Redis primary writes; replicas async-replicate for reads.	Read scaling + HA; replicas may return slightly stale data — acceptable for many caches.
Gossip / cluster bus	Nodes exchange topology and failure signals.	Enables automatic failover when a primary dies without manual DNS edits.

Shared definitions: 00-glossary-common-services.md

Low-level design

Deployment choices

Option	When
ElastiCache Redis Cluster	AWS native, ops minimized
Azure Cache for Redis Enterprise	Active geo-replication
KeyDB	Multi-threaded, Redis protocol compatible
Memcached	Pure cache, no structures — simpler but fewer features

Cache patterns

Cache-aside: app reads DB on miss, populates cache.
Write-through: write cache + DB synchronously — stronger consistency, slower writes.
Write-behind: write cache first, async flush — risky on crash unless durable queue.

Eviction

allkeys-lru vs volatile-ttl — choose based on whether every key should be evictable.

Hot keys

Problem: one celebrity user key on one shard.
Mitigations: local in-process cache; read replicas; application-level sharding user:1234:slot{n}.

E2E: cache-aside read

Rendering diagram…

Tricky parts

Problem	Solution
Thundering herd on expiry	Jitter TTL; singleflight mutex per key
Stampede after cold start	Warmup job; probabilistic early refresh
Large values	Compress (Snappy); or split hash fields

Caveats

Redis is not a queue at scale (BLPOP pitfalls) — prefer SQS / RabbitMQ for durable work queues.
TLS overhead — use VPC peering + in-transit TLS where compliance demands.

Security

AUTH password + ACL users per service; no public endpoints; Redis 6 ACL least privilege.