Distributed Job Scheduler (Cron at Scale)
Problem statement
Run millions of recurring and one-off jobs (HTTP hooks, data pipelines) with at-least-once or exactly-once semantics, retries, DLQ, and multi-tenant fairness.
How it works
- Scheduler picks due jobs from a time-ordered store and enqueues execution messages.
- Workers pull jobs, execute, ack or retry with backoff.
Analogy: A school bell system that doesn’t miss periods even if one classroom’s clock is wrong — the central office clock (scheduler) is authoritative.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Scheduler API | CRUD on job definitions + manual triggers. | Human/admin interface; validates cron TZ and permissions. |
| Leader scheduler (K8s lease) | Exactly one active scheduler tick loop. | Prevents duplicate firing of same cron across replicas without external vendor lock-in. |
| PostgreSQL metadata | Job definitions, next_run, last_status. | Transactional updates to job rows with FOR UPDATE SKIP LOCKED for claiming due work. |
| SQS / Rabbit delayed | Executable job messages. | Workers compete horizontally; visibility timeout acts like a lease on job execution. |
| Worker pool | Executes HTTP hooks / internal tasks / K8s Jobs. | Isolates failures per job with retries/DLQ. |
| DLQ | Dead-letter queue for poison jobs. | Prevents infinite retry loops; surfaces failures to on-call dashboards. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Due job discovery
- Cassandra with time-bucket partitions (
2025-04-25-14) + secondary index — good write spread. - Simpler: PostgreSQL
SELECT ... WHERE next_run_at <= now() FOR UPDATE SKIP LOCKEDwith indexednext_run_at— works to moderate scale. - Managed: AWS EventBridge Scheduler, Azure Logic Apps recurrence, Google Cloud Scheduler.
Exactly-once illusion
- True exactly-once execution is impossible across networks; use idempotent handlers + dedupe store (
job_run_idin DynamoDB TTL).
Fairness
- Per-tenant queues to prevent one tenant’s million jobs from starving others.
Misfire policy
- Fire now, skip, or coalesce — configurable per job type.
E2E: recurring daily report
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Split brain two schedulers | K8s lease / DynamoDB conditional lock / Redlock careful |
| Clock skew | UTC only; NTP monitoring |
| Long jobs | Separate long-running runner (Step Functions) vs short HTTP |
Caveats
- Cron syntax timezones — store IANA tz per job; DST bugs are common.
- Backfill storms after outage — rate limit catch-up executions.
Azure
- Durable Functions timers; Azure Functions + Storage Queues; Service Bus sessions for ordering.