Distributed Job Scheduler (Cron at Scale)

Problem statement

Run millions of recurring and one-off jobs (HTTP hooks, data pipelines) with at-least-once or exactly-once semantics, retries, DLQ, and multi-tenant fairness.

How it works

Scheduler picks due jobs from a time-ordered store and enqueues execution messages.
Workers pull jobs, execute, ack or retry with backoff.

Analogy: A school bell system that doesn’t miss periods even if one classroom’s clock is wrong — the central office clock (scheduler) is authoritative.

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Scheduler API	CRUD on job definitions + manual triggers.	Human/admin interface; validates cron TZ and permissions.
Leader scheduler (K8s lease)	Exactly one active scheduler tick loop.	Prevents duplicate firing of same cron across replicas without external vendor lock-in.
PostgreSQL metadata	Job definitions, next_run, last_status.	Transactional updates to job rows with `FOR UPDATE SKIP LOCKED` for claiming due work.
SQS / Rabbit delayed	Executable job messages.	Workers compete horizontally; visibility timeout acts like a lease on job execution.
Worker pool	Executes HTTP hooks / internal tasks / K8s Jobs.	Isolates failures per job with retries/DLQ.
DLQ	Dead-letter queue for poison jobs.	Prevents infinite retry loops; surfaces failures to on-call dashboards.

Shared definitions: 00-glossary-common-services.md

Low-level design

Due job discovery

Cassandra with time-bucket partitions (2025-04-25-14) + secondary index — good write spread.
Simpler: PostgreSQL SELECT ... WHERE next_run_at <= now() FOR UPDATE SKIP LOCKED with indexed next_run_at — works to moderate scale.
Managed: AWS EventBridge Scheduler, Azure Logic Apps recurrence, Google Cloud Scheduler.

Exactly-once illusion

True exactly-once execution is impossible across networks; use idempotent handlers + dedupe store (job_run_id in DynamoDB TTL).

Fairness

Per-tenant queues to prevent one tenant’s million jobs from starving others.

Misfire policy

Fire now, skip, or coalesce — configurable per job type.

E2E: recurring daily report

Rendering diagram…

Tricky parts

Problem	Solution
Split brain two schedulers	K8s lease / DynamoDB conditional lock / Redlock careful
Clock skew	UTC only; NTP monitoring
Long jobs	Separate long-running runner (Step Functions) vs short HTTP

Caveats

Cron syntax timezones — store IANA tz per job; DST bugs are common.
Backfill storms after outage — rate limit catch-up executions.

Azure

Durable Functions timers; Azure Functions + Storage Queues; Service Bus sessions for ordering.