Designing a Key-Value Store (Dynamo-like)
Problem statement
Build a horizontally scalable KV store with tunable consistency, partitioning, replication, and failure tolerance (not off-the-shelf Redis).
How it works
- Partitioning: consistent hashing ring; each key maps to coordinator node + N replicas.
- Replication: leader-follower or multi-leader (conflict resolution harder).
- Quorum:
R + W > Nfor strong-ish reads/writes; R=1,W=1 for eventual speed.
Analogy: Bank safety deposit boxes arranged in aisles (partitions); each box is copied to two other vaults (replicas) so fire in one aisle doesn’t destroy your jewels.
High-level architecture
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Client SDK | Routes keys to correct shard using hash ring. | Hides partitioning complexity from application teams. |
| Load balancer | Health checks + spreads requests. | Removes failed nodes from rotation during rolling deploys. |
| Node ring + SSTables | Storage nodes running LSM engine (e.g. RocksDB). | Write-optimized storage engine for high ingest; compaction trades read amplification. |
| Gossip membership | Nodes discover peers and failures. | Decentralized cluster management; must handle split-brain carefully with quorum configs. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Storage engine
- LSM-tree (RocksDB) for write throughput; SSTables immutable on disk.
- Memtable flush triggers; compaction strategies (leveled vs tiered).
Consistency
- Vector clocks or version stamps for multi-master conflicts → last-writer-wins or application merge.
Anti-entropy
- Merkle trees between replicas to detect divergence; repair streams.
Client routing
- Gossip or central metadata (Zookeeper) publishes ring map; clients cache with staleness tolerance.
E2E: quorum write
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Hot partition | salt keys user#123!42 random suffix + merge reads |
| Hinted handoff | When replica down, temporary write to another node + repair later |
| Paxos/Raft | Use Raft library (etcd) for shard group coordination instead of inventing |
Caveats
- Designing from scratch in interviews = discuss tradeoffs, not production code.
- Dynamo paper (gossip, sloppy quorum, hinted handoff) is canonical reading.
Managed shortcut
- DynamoDB, Cosmos DB, Spanner — only build custom when extreme cost/latency needs unserved.