Multi-channel Notification System
Problem statement
Deliver email, SMS, push, in-app, and webhooks reliably with templates, preferences, throttling, and auditability at high volume.
How it works
- Product services emit NotificationRequested events (user id, channel, template id, payload).
- Orchestrator resolves user preferences, quiet hours, locale, and fan-out to channel workers.
- Each channel adapter calls provider APIs (SES, Twilio, FCM) with retries and idempotency.
Analogy: A hotel concierge desk: many departments (channels) but one queue ticket so guests are not spammed across five phones at once.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Product services | Core domain apps emitting business events. | They should not directly call SES/Twilio for every user action — decouple via bus/queue. |
Kafka notification.events | Durable stream of “something happened”. | Replay for new notification channels; multiple consumers (email, push) without coupling producers. |
| Orchestrator | Reads preferences, fans out to channel queues. | Central place for quiet hours, locale, channel opt-in rules — avoids duplicating in every producer. |
| SQS per channel | Isolated queues for email/SMS/push/webhook workers. | Bulkheads: a stuck SMS provider doesn’t block push processing; tune concurrency per queue. |
| Channel workers + providers | Lambdas/containers calling SES, Twilio, FCM, HTTP. | Wraps provider-specific retries, DLQ, and idempotency (Idempotency-Key + Dynamo). |
| DynamoDB dedupe | Fast conditional writes for idempotency keys. | Prevents duplicate charges or double emails under at-least-once delivery. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Idempotency
- Idempotency-Key per
(user, template, logical_event_id)stored in DynamoDB with TTL 48h to drop duplicates from at-least-once Kafka.
Preferences & compliance
- Opt-in flags per channel in PostgreSQL or Dynamo.
- Twilio STOP handling → webhook updates preference = false.
- CAN-SPAM / GDPR: unsubscribe link + Suppression list in Redis SET.
Templates
- Handlebars / Jinja stored in S3 versioned; cache in worker memory with ETag invalidation.
Webhooks (outbound to customers)
- Exponential backoff + max attempts; HMAC signature
X-Signaturefor authenticity. - Azure Event Grid / AWS EventBridge patterns for first-party; custom worker for arbitrary HTTPS.
Push
- FCM topic vs token model; device token rotation handled by periodic re-registration.
E2E: order shipped notification
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Thundering herd (Black Friday) | Token bucket per provider account; multiple SES dedicated IPs |
| Provider outage | Failover region; secondary provider (Postmark) |
| PII in logs | Redact payload fields; structured logging with allowlist |
Caveats
- SMS cost — never use SMS for bulk marketing without consent.
- Webhook security — mTLS or signed payloads + replay protection with nonce store.