Database Sharding Strategy
Problem statement
A single PostgreSQL (or monolithic DB) hits CPU/IO ceiling; you need horizontal partition of data with minimal cross-shard joins and resharding story.
How it works
- Choose shard key (e.g.
tenant_id,user_id) → deterministic routing functionshard = hash(key) % N. - Each shard is independent DB instance; application router or Vitess/Citus proxy sends queries.
Analogy: City postal codes: mail for 560001 always goes to Bangalore sorting office — workers don’t search all of India for every letter.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Shard router (Vitess/custom) | Computes shard from key; proxies SQL. | Lets apps mostly write normal SQL while infra handles scatter-gather where unavoidable. |
| Shard 0..N | Independent DB instances holding subset of data. | Horizontal scale when single instance CPU/IO maxed. |
| Metadata (etcd/ZK) | Stores shard map and migrations. | Operators need authoritative routing table during resharding. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Shard key selection
- High cardinality + query locality — avoid hot shard (global config table).
- Tenant isolation SaaS:
tenant_idgreat for noisy neighbor containment.
Resharding
- Double-write new + old shards during migration; background backfill; flip read traffic; cutover.
- Consistent hashing with virtual nodes reduces data movement when adding shard.
Cross-shard queries
- Avoid if possible; else scatter-gather (expensive) or CQRS read models in warehouse.
Sequences / auto-increment
- Global uniqueness: Snowflake IDs or UUIDv7 — no cross-shard sequence.
E2E: point read
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Join across users | Denormalize or async materialized view |
| Transactions spanning shards | Saga; 2PC rarely worth it |
| Rebalancing pain | Use managed sharding (Citus, Vitess, Dynamo) |
Caveats
- ORM assumptions break — educate team on explicit shard awareness.
- Backup/restore now N operations — automate with Infrastructure as Code.
Azure / AWS
- Cosmos DB partition key; Aurora Limitless (preview patterns); Citus on Azure.