SuryanandHome

Saga vs Two-Phase Commit (Distributed Transactions)

Problem statement

Coordinate multiple microservices to complete a business transaction (trip booking + payment + loyalty) with partial failure handling — choose between 2PC, Saga (choreography/orchestration), and outbox.

How it works — mental models

  • 2PC: voting phase then commitblocking, single coordinator failure fragile; rarely used across microservices today.
  • Saga: sequence of local transactions + compensating actions on failure — eventually consistent business state.
  • Outbox: same DB transaction writes business row + outbox eventrelay publishes to bus — at-least-once downstream with idempotency.

Analogy: 2PC = everyone holds breath until conductor says “go” — if conductor faints, everyone stuck. Saga = relay race — if runner 3 drops baton, runners 1–2 undo their laps (compensate) instead of freezing time.

High-level comparison

Rendering diagram…

When to use what

PatternWhen
2PC / XASame data center, same DBA team, legacy monolithic DB shards only
Choreography sagaFew steps, clear events, team prefers loose coupling
Orchestrated sagaComplex branching/compensation — Temporal, Camunda, AWS Step Functions
OutboxNeed reliable publish after DB commit

Components explained — this design

ComponentWhat it isWhy we use it here
Coordinator (2PC)Votes then commits across participants.Shown as contrast: rarely used cross microservices due to blocking and fragility.
Orchestrator (Temporal / Step Functions)Central workflow with explicit compensation steps.Readable saga; retries and timeouts built-in vs DIY state machines.
ParticipantsServices owning local transactions.Each commits only its own DB; no distributed locks across unrelated schemas ideally.

Shared definitions: 00-glossary-common-services.md

Low-level design notes

Temporal.io example

  • Workflow code defines retries, timers, SAGA compensation functions explicitly versioned.
  • Workers idempotent activities with workflow id + run id dedupe.

Step Functions

  • JSON ASL state machine; native AWS integrations; DLQ on failures.

Pitfalls

  • Compensation failure — must be idempotent and monitor stuck sagas.
  • Semantic lock — user sees “failed” while compensation still runningUI states design.

E2E: orchestrated happy path

Rendering diagram…

Tricky parts

ProblemSolution
Duplicate chargesIdempotency-Key on Pay; saga id correlation
Visibility timeoutNot saga — don’t build sagas on raw SQS visibility alone

Caveats

  • ACID across services is a smell — redesign bounded context boundaries first.
  • Read your writes across services needs CQRS materialized views with lag disclosure.

Azure

  • Durable Functions orchestration; Service Bus sessions for ordered processing.