Multi-Region Active-Active
Problem statement
Serve users from US and EU with low read latency, write availability during regional outage, and compliance (data residency) — without silent data loss.
How it works
Patterns:
- Active-passive: one region writes; DR region promotes on failover (simpler).
- Active-active: both regions accept writes — needs conflict resolution and routing rules.
Analogy: Two branch offices editing the same shared Excel without a server — you need rules (who wins conflicts) or separate tabs per branch (partition users by home region).
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Route53 latency routing | DNS returns nearest healthy region endpoint. | Improves read RTT for regional deployments without client-side region picker. |
| Regional ALB/ELB | Load balances within one region. | Failure domain isolation: blast radius smaller than one global VIP misconfig. |
| Dynamo global tables / Aurora Global | Cross-region replication technologies. | Dynamo offers multi-master with LWW tradeoffs; Aurora is typically single writer + read replicas globally. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Data residency
- EU users’ PII must stay in EU — home region flag on user; reject cross-region replication for restricted tables or encrypt with region-bound KMS keys.
DynamoDB global tables
- Last-writer-wins based on timestamp; application must tolerate.
- Strongly consistent reads only within region on same replica timeline.
Aurora Global Database
- Single writer primary in one region; read replicas globally; managed failover promotes secondary region (RTO/RPO targets).
Caching
- Redis Global Datastore with active-active but conflict resolution is LWW — cache non-authoritative only.
E2E: write in EU, read in US
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Split brain writes | Cell architecture — user pinned to home cell |
| Clock skew LWW | Hybrid logical clocks or version vectors |
| Legal discovery | Immutable audit logs per region |
Caveats
- True active-active SQL with foreign keys is hard — often CQRS + event log is cleaner.
- Testing chaos inject regional partition regularly (GameDays).
Azure
- Cosmos DB multi-region writes; SQL geo-replication; Front Door global LB.