SuryanandHome

Designing a Key-Value Store (Dynamo-like)

Problem statement

Build a horizontally scalable KV store with tunable consistency, partitioning, replication, and failure tolerance (not off-the-shelf Redis).

How it works

  • Partitioning: consistent hashing ring; each key maps to coordinator node + N replicas.
  • Replication: leader-follower or multi-leader (conflict resolution harder).
  • Quorum: R + W > N for strong-ish reads/writes; R=1,W=1 for eventual speed.

Analogy: Bank safety deposit boxes arranged in aisles (partitions); each box is copied to two other vaults (replicas) so fire in one aisle doesn’t destroy your jewels.

High-level architecture

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Client SDKRoutes keys to correct shard using hash ring.Hides partitioning complexity from application teams.
Load balancerHealth checks + spreads requests.Removes failed nodes from rotation during rolling deploys.
Node ring + SSTablesStorage nodes running LSM engine (e.g. RocksDB).Write-optimized storage engine for high ingest; compaction trades read amplification.
Gossip membershipNodes discover peers and failures.Decentralized cluster management; must handle split-brain carefully with quorum configs.

Shared definitions: 00-glossary-common-services.md

Low-level design

Storage engine

  • LSM-tree (RocksDB) for write throughput; SSTables immutable on disk.
  • Memtable flush triggers; compaction strategies (leveled vs tiered).

Consistency

  • Vector clocks or version stamps for multi-master conflicts → last-writer-wins or application merge.

Anti-entropy

  • Merkle trees between replicas to detect divergence; repair streams.

Client routing

  • Gossip or central metadata (Zookeeper) publishes ring map; clients cache with staleness tolerance.

E2E: quorum write

Rendering diagram…

Tricky parts

ProblemSolution
Hot partitionsalt keys user#123!42 random suffix + merge reads
Hinted handoffWhen replica down, temporary write to another node + repair later
Paxos/RaftUse Raft library (etcd) for shard group coordination instead of inventing

Caveats

  • Designing from scratch in interviews = discuss tradeoffs, not production code.
  • Dynamo paper (gossip, sloppy quorum, hinted handoff) is canonical reading.

Managed shortcut

  • DynamoDB, Cosmos DB, Spanner — only build custom when extreme cost/latency needs unserved.