SuryanandHome

Distributed Cache (Redis-style cluster)

Problem statement

Provide sub-millisecond reads/writes, TTL, pub/sub, optional persistence, and horizontal scale across AZs with graceful failover.

How it works

  • Single-threaded Redis process per shard → shard by hash slot (16k slots in Redis Cluster).
  • Clients use cluster-aware client (MOVED/ASK redirects).
  • Replication: primary + replicas; sentinel or managed failover (ElastiCache Multi-AZ).

Analogy: A row of vending machines (shards): the key’s hash picks the machine; if one breaks, a replica takes over with the same snacks (data).

High-level architecture

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Redis client libraryCluster-aware driver handling MOVED/ASK.Redis Cluster shards keys; client must follow redirects and maintain connection pools.
Primary / Replica shardsRedis primary writes; replicas async-replicate for reads.Read scaling + HA; replicas may return slightly stale data — acceptable for many caches.
Gossip / cluster busNodes exchange topology and failure signals.Enables automatic failover when a primary dies without manual DNS edits.

Shared definitions: 00-glossary-common-services.md

Low-level design

Deployment choices

OptionWhen
ElastiCache Redis ClusterAWS native, ops minimized
Azure Cache for Redis EnterpriseActive geo-replication
KeyDBMulti-threaded, Redis protocol compatible
MemcachedPure cache, no structures — simpler but fewer features

Cache patterns

  • Cache-aside: app reads DB on miss, populates cache.
  • Write-through: write cache + DB synchronously — stronger consistency, slower writes.
  • Write-behind: write cache first, async flush — risky on crash unless durable queue.

Eviction

  • allkeys-lru vs volatile-ttl — choose based on whether every key should be evictable.

Hot keys

  • Problem: one celebrity user key on one shard.
  • Mitigations: local in-process cache; read replicas; application-level sharding user:1234:slot{n}.

E2E: cache-aside read

Rendering diagram…

Tricky parts

ProblemSolution
Thundering herd on expiryJitter TTL; singleflight mutex per key
Stampede after cold startWarmup job; probabilistic early refresh
Large valuesCompress (Snappy); or split hash fields

Caveats

  • Redis is not a queue at scale (BLPOP pitfalls) — prefer SQS / RabbitMQ for durable work queues.
  • TLS overhead — use VPC peering + in-transit TLS where compliance demands.

Security

  • AUTH password + ACL users per service; no public endpoints; Redis 6 ACL least privilege.