Distributed Locking
Problem statement
Coordinate exclusive access to a resource (migrate shard, cron leader, inventory mutation) across many unreliable processes without split-brain.
How it works
- Acquire lease in central store with TTL; heartbeat renews while work proceeds.
- On crash, TTL expiry releases lock automatically (fencing still needed for storage writes).
Analogy: A bathroom occupied sign with auto-reset if someone faints inside longer than 30 minutes — next person may enter, but you must handle stale work carefully.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Redis SET NX EX | Set key only if absent with TTL. | Lease pattern for locks; TTL auto-releases after crash (but use fencing for storage!). |
| Fenced storage writes | Storage rejects stale writers using monotonic token. | Prevents zombie primary from corrupting data after lease expiry extended by GC pause. |
| ZooKeeper/etcd (alternative) | Coordination service with ephemeral sequential nodes. | Stronger lease semantics for leader election than Redis in some cases. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Redis Redlock (controversial)
- N independent Redis masters quorum acquire — Martin Kleppmann critique: clock drift issues.
- Pragmatic AWS: DynamoDB conditional PutItem with
lease_owner+lease_version— stronger if used correctly.
Etcd / Consul sessions
- Session TTL + ephemeral keys — good for Kubernetes leader election patterns.
Fencing token
- Problem: delayed old primary writes after lock loss.
- Fix: storage layer rejects writes unless
token > last_committed_tokenfrom Zookeeper / etcd sequence.
E2E: acquire → work → release
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Long GC pause > TTL | Reasonable TTL + fencing; or DB advisory locks if same DB |
| Unlock wrong owner | Lua script compare token before DEL |
| Thundering herd after expiry | Randomized TTL jitter |
Caveats
- In-process mutex useless cross-host — obvious but common mistake.
- SQS visibility timeout is not a general-purpose lock — different semantics.
Managed
- Amazon DynamoDB lock client, S3 conditional writes for single-row resources.
When not to use distributed locks
- Prefer idempotent Saga + unique constraints; locks add operational risk.