SuryanandHome

Multi-channel Notification System

Problem statement

Deliver email, SMS, push, in-app, and webhooks reliably with templates, preferences, throttling, and auditability at high volume.

How it works

  1. Product services emit NotificationRequested events (user id, channel, template id, payload).
  2. Orchestrator resolves user preferences, quiet hours, locale, and fan-out to channel workers.
  3. Each channel adapter calls provider APIs (SES, Twilio, FCM) with retries and idempotency.

Analogy: A hotel concierge desk: many departments (channels) but one queue ticket so guests are not spammed across five phones at once.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Product servicesCore domain apps emitting business events.They should not directly call SES/Twilio for every user action — decouple via bus/queue.
Kafka notification.eventsDurable stream of “something happened”.Replay for new notification channels; multiple consumers (email, push) without coupling producers.
OrchestratorReads preferences, fans out to channel queues.Central place for quiet hours, locale, channel opt-in rules — avoids duplicating in every producer.
SQS per channelIsolated queues for email/SMS/push/webhook workers.Bulkheads: a stuck SMS provider doesn’t block push processing; tune concurrency per queue.
Channel workers + providersLambdas/containers calling SES, Twilio, FCM, HTTP.Wraps provider-specific retries, DLQ, and idempotency (Idempotency-Key + Dynamo).
DynamoDB dedupeFast conditional writes for idempotency keys.Prevents duplicate charges or double emails under at-least-once delivery.

Shared definitions: 00-glossary-common-services.md

Low-level design

Idempotency

  • Idempotency-Key per (user, template, logical_event_id) stored in DynamoDB with TTL 48h to drop duplicates from at-least-once Kafka.

Preferences & compliance

  • Opt-in flags per channel in PostgreSQL or Dynamo.
  • Twilio STOP handling → webhook updates preference = false.
  • CAN-SPAM / GDPR: unsubscribe link + Suppression list in Redis SET.

Templates

  • Handlebars / Jinja stored in S3 versioned; cache in worker memory with ETag invalidation.

Webhooks (outbound to customers)

  • Exponential backoff + max attempts; HMAC signature X-Signature for authenticity.
  • Azure Event Grid / AWS EventBridge patterns for first-party; custom worker for arbitrary HTTPS.

Push

  • FCM topic vs token model; device token rotation handled by periodic re-registration.

E2E: order shipped notification

Rendering diagram…

Tricky parts

ProblemSolution
Thundering herd (Black Friday)Token bucket per provider account; multiple SES dedicated IPs
Provider outageFailover region; secondary provider (Postmark)
PII in logsRedact payload fields; structured logging with allowlist

Caveats

  • SMS cost — never use SMS for bulk marketing without consent.
  • Webhook securitymTLS or signed payloads + replay protection with nonce store.