SuryanandHome

Multi-Region Active-Active

Problem statement

Serve users from US and EU with low read latency, write availability during regional outage, and compliance (data residency) — without silent data loss.

How it works

Patterns:

  1. Active-passive: one region writes; DR region promotes on failover (simpler).
  2. Active-active: both regions accept writes — needs conflict resolution and routing rules.

Analogy: Two branch offices editing the same shared Excel without a server — you need rules (who wins conflicts) or separate tabs per branch (partition users by home region).

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Route53 latency routingDNS returns nearest healthy region endpoint.Improves read RTT for regional deployments without client-side region picker.
Regional ALB/ELBLoad balances within one region.Failure domain isolation: blast radius smaller than one global VIP misconfig.
Dynamo global tables / Aurora GlobalCross-region replication technologies.Dynamo offers multi-master with LWW tradeoffs; Aurora is typically single writer + read replicas globally.

Shared definitions: 00-glossary-common-services.md

Low-level design

Data residency

  • EU users’ PII must stay in EU — home region flag on user; reject cross-region replication for restricted tables or encrypt with region-bound KMS keys.

DynamoDB global tables

  • Last-writer-wins based on timestamp; application must tolerate.
  • Strongly consistent reads only within region on same replica timeline.

Aurora Global Database

  • Single writer primary in one region; read replicas globally; managed failover promotes secondary region (RTO/RPO targets).

Caching

  • Redis Global Datastore with active-active but conflict resolution is LWW — cache non-authoritative only.

E2E: write in EU, read in US

Rendering diagram…

Tricky parts

ProblemSolution
Split brain writesCell architecture — user pinned to home cell
Clock skew LWWHybrid logical clocks or version vectors
Legal discoveryImmutable audit logs per region

Caveats

  • True active-active SQL with foreign keys is hard — often CQRS + event log is cleaner.
  • Testing chaos inject regional partition regularly (GameDays).

Azure

  • Cosmos DB multi-region writes; SQL geo-replication; Front Door global LB.