Distributed Tracing (Observability)
Problem statement
Debug microservices by correlating logs, metrics, and traces across process boundaries with low overhead and PII safety.
How it works
- Instrumentation libraries add spans around operations; trace-id propagates via HTTP headers (
traceparentW3C) or gRPC metadata. - Collector receives OTLP spans → backend stores & queries (Jaeger UI, Grafana Tempo, AWS X-Ray).
Analogy: FedEx tracking number on every package leg — you see the full journey even though different trucks (services) moved it.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Service A/B | Instrumented apps emitting spans. | Spans nest RPC calls into a tree for one user request. |
| OTLP Collector | Agent/sidecar receiving telemetry; can batch/export. | Vendor-neutral pipeline; can redact PII before export. |
| Tempo / X-Ray / Honeycomb | Trace storage + UI. | Trade self-host cost vs vendor features (tail sampling, high-cardinality). |
| Prometheus remote write (optional) | Sends metrics alongside traces. | Correlated triage: jump from metric spike to example traces. |
| Grafana | Dashboards across metrics/logs/traces. | Single pane for SLO burn investigations. |
Shared definitions: 00-glossary-common-services.md
Low-level design
OpenTelemetry
- SDK auto-instrumentation for Node/Java + manual spans for critical sections.
- Sampling: head-based (decide at root) vs tail-based (Observability vendor feature) — tail captures rare errors post-fact.
Propagation
- W3C tracecontext standard; avoid custom
X-B3-*unless legacy Zipkin bridge.
Storage
- High cardinality (user_id on every span) is expensive — use low-cardinality tags + baggage carefully.
PII scrubbing
- Collector processor to hash emails, drop bodies; never log raw tokens.
E2E: one HTTP request across services
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Async work loses context | OpenTelemetry context propagation through message headers |
| Batch jobs | Links to parent trace instead of single child chain |
| Cost explosion | Dynamic sampling — 100% errors, 1% success |
Caveats
- Tracing != logging — traces show structure; still need structured logs with
trace_idcorrelation field. - Clock skew across hosts affects span ordering — NTP discipline.
Cloud picks
| AWS | Azure | GCP |
|---|---|---|
| X-Ray + ADOT | App Insights + OTel | Cloud Trace |