Designing a Key-Value Store (Dynamo-like)

Problem statement

Build a horizontally scalable KV store with tunable consistency, partitioning, replication, and failure tolerance (not off-the-shelf Redis).

How it works

Partitioning: consistent hashing ring; each key maps to coordinator node + N replicas.
Replication: leader-follower or multi-leader (conflict resolution harder).
Quorum: R + W > N for strong-ish reads/writes; R=1,W=1 for eventual speed.

Analogy: Bank safety deposit boxes arranged in aisles (partitions); each box is copied to two other vaults (replicas) so fire in one aisle doesn’t destroy your jewels.

High-level architecture

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Client SDK	Routes keys to correct shard using hash ring.	Hides partitioning complexity from application teams.
Load balancer	Health checks + spreads requests.	Removes failed nodes from rotation during rolling deploys.
Node ring + SSTables	Storage nodes running LSM engine (e.g. RocksDB).	Write-optimized storage engine for high ingest; compaction trades read amplification.
Gossip membership	Nodes discover peers and failures.	Decentralized cluster management; must handle split-brain carefully with quorum configs.

Shared definitions: 00-glossary-common-services.md

Low-level design

Storage engine

LSM-tree (RocksDB) for write throughput; SSTables immutable on disk.
Memtable flush triggers; compaction strategies (leveled vs tiered).

Consistency

Vector clocks or version stamps for multi-master conflicts → last-writer-wins or application merge.

Anti-entropy

Merkle trees between replicas to detect divergence; repair streams.

Client routing

Gossip or central metadata (Zookeeper) publishes ring map; clients cache with staleness tolerance.

E2E: quorum write

Rendering diagram…

Tricky parts

Problem	Solution
Hot partition	salt keys `user#123!42` random suffix + merge reads
Hinted handoff	When replica down, temporary write to another node + repair later
Paxos/Raft	Use Raft library (etcd) for shard group coordination instead of inventing

Caveats

Designing from scratch in interviews = discuss tradeoffs, not production code.
Dynamo paper (gossip, sloppy quorum, hinted handoff) is canonical reading.

Managed shortcut

DynamoDB, Cosmos DB, Spanner — only build custom when extreme cost/latency needs unserved.