SuryanandHome

Distributed Locking

Problem statement

Coordinate exclusive access to a resource (migrate shard, cron leader, inventory mutation) across many unreliable processes without split-brain.

How it works

  • Acquire lease in central store with TTL; heartbeat renews while work proceeds.
  • On crash, TTL expiry releases lock automatically (fencing still needed for storage writes).

Analogy: A bathroom occupied sign with auto-reset if someone faints inside longer than 30 minutes — next person may enter, but you must handle stale work carefully.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Redis SET NX EXSet key only if absent with TTL.Lease pattern for locks; TTL auto-releases after crash (but use fencing for storage!).
Fenced storage writesStorage rejects stale writers using monotonic token.Prevents zombie primary from corrupting data after lease expiry extended by GC pause.
ZooKeeper/etcd (alternative)Coordination service with ephemeral sequential nodes.Stronger lease semantics for leader election than Redis in some cases.

Shared definitions: 00-glossary-common-services.md

Low-level design

Redis Redlock (controversial)

  • N independent Redis masters quorum acquire — Martin Kleppmann critique: clock drift issues.
  • Pragmatic AWS: DynamoDB conditional PutItem with lease_owner + lease_versionstronger if used correctly.

Etcd / Consul sessions

  • Session TTL + ephemeral keys — good for Kubernetes leader election patterns.

Fencing token

  • Problem: delayed old primary writes after lock loss.
  • Fix: storage layer rejects writes unless token > last_committed_token from Zookeeper / etcd sequence.

E2E: acquire → work → release

Rendering diagram…

Tricky parts

ProblemSolution
Long GC pause > TTLReasonable TTL + fencing; or DB advisory locks if same DB
Unlock wrong ownerLua script compare token before DEL
Thundering herd after expiryRandomized TTL jitter

Caveats

  • In-process mutex useless cross-host — obvious but common mistake.
  • SQS visibility timeout is not a general-purpose lock — different semantics.

Managed

  • Amazon DynamoDB lock client, S3 conditional writes for single-row resources.

When not to use distributed locks

  • Prefer idempotent Saga + unique constraints; locks add operational risk.