Saga vs Two-Phase Commit (Distributed Transactions)
Problem statement
Coordinate multiple microservices to complete a business transaction (trip booking + payment + loyalty) with partial failure handling — choose between 2PC, Saga (choreography/orchestration), and outbox.
How it works — mental models
- 2PC: voting phase then commit — blocking, single coordinator failure fragile; rarely used across microservices today.
- Saga: sequence of local transactions + compensating actions on failure — eventually consistent business state.
- Outbox: same DB transaction writes business row + outbox event → relay publishes to bus — at-least-once downstream with idempotency.
Analogy: 2PC = everyone holds breath until conductor says “go” — if conductor faints, everyone stuck. Saga = relay race — if runner 3 drops baton, runners 1–2 undo their laps (compensate) instead of freezing time.
High-level comparison
Rendering diagram…
When to use what
| Pattern | When |
|---|---|
| 2PC / XA | Same data center, same DBA team, legacy monolithic DB shards only |
| Choreography saga | Few steps, clear events, team prefers loose coupling |
| Orchestrated saga | Complex branching/compensation — Temporal, Camunda, AWS Step Functions |
| Outbox | Need reliable publish after DB commit |
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Coordinator (2PC) | Votes then commits across participants. | Shown as contrast: rarely used cross microservices due to blocking and fragility. |
| Orchestrator (Temporal / Step Functions) | Central workflow with explicit compensation steps. | Readable saga; retries and timeouts built-in vs DIY state machines. |
| Participants | Services owning local transactions. | Each commits only its own DB; no distributed locks across unrelated schemas ideally. |
Shared definitions: 00-glossary-common-services.md
Low-level design notes
Temporal.io example
- Workflow code defines retries, timers, SAGA compensation functions explicitly versioned.
- Workers idempotent activities with workflow id + run id dedupe.
Step Functions
- JSON ASL state machine; native AWS integrations; DLQ on failures.
Pitfalls
- Compensation failure — must be idempotent and monitor stuck sagas.
- Semantic lock — user sees “failed” while compensation still running — UI states design.
E2E: orchestrated happy path
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Duplicate charges | Idempotency-Key on Pay; saga id correlation |
| Visibility timeout | Not saga — don’t build sagas on raw SQS visibility alone |
Caveats
- ACID across services is a smell — redesign bounded context boundaries first.
- Read your writes across services needs CQRS materialized views with lag disclosure.
Azure
- Durable Functions orchestration; Service Bus sessions for ordered processing.