Distributed Locking

Problem statement

Coordinate exclusive access to a resource (migrate shard, cron leader, inventory mutation) across many unreliable processes without split-brain.

How it works

Acquire lease in central store with TTL; heartbeat renews while work proceeds.
On crash, TTL expiry releases lock automatically (fencing still needed for storage writes).

Analogy: A bathroom occupied sign with auto-reset if someone faints inside longer than 30 minutes — next person may enter, but you must handle stale work carefully.

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Redis SET NX EX	Set key only if absent with TTL.	Lease pattern for locks; TTL auto-releases after crash (but use fencing for storage!).
Fenced storage writes	Storage rejects stale writers using monotonic token.	Prevents zombie primary from corrupting data after lease expiry extended by GC pause.
ZooKeeper/etcd (alternative)	Coordination service with ephemeral sequential nodes.	Stronger lease semantics for leader election than Redis in some cases.

Shared definitions: 00-glossary-common-services.md

Low-level design

Redis Redlock (controversial)

N independent Redis masters quorum acquire — Martin Kleppmann critique: clock drift issues.
Pragmatic AWS: DynamoDB conditional PutItem with lease_owner + lease_version — stronger if used correctly.

Etcd / Consul sessions

Session TTL + ephemeral keys — good for Kubernetes leader election patterns.

Fencing token

Problem: delayed old primary writes after lock loss.
Fix: storage layer rejects writes unless token > last_committed_token from Zookeeper / etcd sequence.

E2E: acquire → work → release

Rendering diagram…

Tricky parts

Problem	Solution
Long GC pause > TTL	Reasonable TTL + fencing; or DB advisory locks if same DB
Unlock wrong owner	Lua script compare token before DEL
Thundering herd after expiry	Randomized TTL jitter

Caveats

In-process mutex useless cross-host — obvious but common mistake.
SQS visibility timeout is not a general-purpose lock — different semantics.

Managed

Amazon DynamoDB lock client, S3 conditional writes for single-row resources.

When not to use distributed locks

Prefer idempotent Saga + unique constraints; locks add operational risk.