SuryanandHome

Distributed Tracing (Observability)

Problem statement

Debug microservices by correlating logs, metrics, and traces across process boundaries with low overhead and PII safety.

How it works

  • Instrumentation libraries add spans around operations; trace-id propagates via HTTP headers (traceparent W3C) or gRPC metadata.
  • Collector receives OTLP spans → backend stores & queries (Jaeger UI, Grafana Tempo, AWS X-Ray).

Analogy: FedEx tracking number on every package leg — you see the full journey even though different trucks (services) moved it.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Service A/BInstrumented apps emitting spans.Spans nest RPC calls into a tree for one user request.
OTLP CollectorAgent/sidecar receiving telemetry; can batch/export.Vendor-neutral pipeline; can redact PII before export.
Tempo / X-Ray / HoneycombTrace storage + UI.Trade self-host cost vs vendor features (tail sampling, high-cardinality).
Prometheus remote write (optional)Sends metrics alongside traces.Correlated triage: jump from metric spike to example traces.
GrafanaDashboards across metrics/logs/traces.Single pane for SLO burn investigations.

Shared definitions: 00-glossary-common-services.md

Low-level design

OpenTelemetry

  • SDK auto-instrumentation for Node/Java + manual spans for critical sections.
  • Sampling: head-based (decide at root) vs tail-based (Observability vendor feature) — tail captures rare errors post-fact.

Propagation

  • W3C tracecontext standard; avoid custom X-B3-* unless legacy Zipkin bridge.

Storage

  • High cardinality (user_id on every span) is expensive — use low-cardinality tags + baggage carefully.

PII scrubbing

  • Collector processor to hash emails, drop bodies; never log raw tokens.

E2E: one HTTP request across services

Rendering diagram…

Tricky parts

ProblemSolution
Async work loses contextOpenTelemetry context propagation through message headers
Batch jobsLinks to parent trace instead of single child chain
Cost explosionDynamic sampling — 100% errors, 1% success

Caveats

  • Tracing != logging — traces show structure; still need structured logs with trace_id correlation field.
  • Clock skew across hosts affects span ordering — NTP discipline.

Cloud picks

AWSAzureGCP
X-Ray + ADOTApp Insights + OTelCloud Trace