Webhook Delivery Platform (Outbound Events)
Problem statement
Deliver HTTP callbacks to customer endpoints for product events (payment succeeded, user.created) with retries, signing, versioning, and observability.
How it works
- Internal services publish DomainEvent to bus.
- Subscriptions table maps
(event_type, customer_id) → https URL + secret. - Dispatcher POSTs JSON with HMAC signature, handles 429/5xx with exponential backoff until dead letter.
Analogy: Certified mail: you get signature proof (delivery logs), redelivery attempts if nobody home, and a dead letter office if the address is wrong forever.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Internal microservices | Domain apps publishing DomainEvent. | Should not synchronously HTTP POST to flaky customer URLs — async improves reliability. |
| EventBridge / Kafka | Event backbone. | Fan-out to webhook dispatcher + other internal consumers without N HTTP calls from producer. |
| Dispatcher | Pulls subscriptions, signs payloads, POSTs HTTPS. | Implements retry/backoff, HMAC signing, SSRF protections. |
| PostgreSQL subscriptions | Customer URL + secret + event filters. | Relational model for admin UI and auditing who subscribed to what. |
| Per-tenant SQS (optional) | Isolated queues for noisy tenants. | Fairness: one customer’s slow endpoint doesn’t exhaust shared worker pool. |
| ClickHouse delivery logs | Columnar analytics on attempts/latency. | Great for SLI dashboards and support “why wasn’t webhook delivered?”. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Security
- HMAC-SHA256 over raw body with timestamp header to prevent replay; tolerance window ±5 minutes.
- mTLS option for enterprise; IP allowlist on customer side documented.
Retries
- Exponential backoff with jitter; max age 72h then DLQ.
- Respect Retry-After header from customer 429.
Idempotency (customer side)
- Send header
X-Delivery-Id: uuidso customer can dedupe even if we retry.
Signing secret rotation
secret_versionin payload footer; customers verify with multiple active secrets during rotation window.
Azure-specific
- Azure Event Grid push handlers with CloudEvents spec; Event Grid built-in retries & dead-letter blob storage.
E2E: dispatch flow
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Customer endpoint slow | Per-tenant concurrency cap |
| SSRF if URL user-controlled | Block private IP ranges; DNS rebinding checks |
| Payload PII | Minimize event schema; reference IDs only |
Caveats
- Ordering: do not guarantee global order across event types unless single partition contract documented.
- At-least-once is the realistic default — customers must dedupe.
Observability
- Delivery dashboard: latency percentiles, success ratio, replay from DLQ with audit.