SuryanandHome

Webhook Delivery Platform (Outbound Events)

Problem statement

Deliver HTTP callbacks to customer endpoints for product events (payment succeeded, user.created) with retries, signing, versioning, and observability.

How it works

  1. Internal services publish DomainEvent to bus.
  2. Subscriptions table maps (event_type, customer_id) → https URL + secret.
  3. Dispatcher POSTs JSON with HMAC signature, handles 429/5xx with exponential backoff until dead letter.

Analogy: Certified mail: you get signature proof (delivery logs), redelivery attempts if nobody home, and a dead letter office if the address is wrong forever.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Internal microservicesDomain apps publishing DomainEvent.Should not synchronously HTTP POST to flaky customer URLs — async improves reliability.
EventBridge / KafkaEvent backbone.Fan-out to webhook dispatcher + other internal consumers without N HTTP calls from producer.
DispatcherPulls subscriptions, signs payloads, POSTs HTTPS.Implements retry/backoff, HMAC signing, SSRF protections.
PostgreSQL subscriptionsCustomer URL + secret + event filters.Relational model for admin UI and auditing who subscribed to what.
Per-tenant SQS (optional)Isolated queues for noisy tenants.Fairness: one customer’s slow endpoint doesn’t exhaust shared worker pool.
ClickHouse delivery logsColumnar analytics on attempts/latency.Great for SLI dashboards and support “why wasn’t webhook delivered?”.

Shared definitions: 00-glossary-common-services.md

Low-level design

Security

  • HMAC-SHA256 over raw body with timestamp header to prevent replay; tolerance window ±5 minutes.
  • mTLS option for enterprise; IP allowlist on customer side documented.

Retries

  • Exponential backoff with jitter; max age 72h then DLQ.
  • Respect Retry-After header from customer 429.

Idempotency (customer side)

  • Send header X-Delivery-Id: uuid so customer can dedupe even if we retry.

Signing secret rotation

  • secret_version in payload footer; customers verify with multiple active secrets during rotation window.

Azure-specific

  • Azure Event Grid push handlers with CloudEvents spec; Event Grid built-in retries & dead-letter blob storage.

E2E: dispatch flow

Rendering diagram…

Tricky parts

ProblemSolution
Customer endpoint slowPer-tenant concurrency cap
SSRF if URL user-controlledBlock private IP ranges; DNS rebinding checks
Payload PIIMinimize event schema; reference IDs only

Caveats

  • Ordering: do not guarantee global order across event types unless single partition contract documented.
  • At-least-once is the realistic default — customers must dedupe.

Observability

  • Delivery dashboard: latency percentiles, success ratio, replay from DLQ with audit.