SuryanandHome

Distributed Job Scheduler (Cron at Scale)

Problem statement

Run millions of recurring and one-off jobs (HTTP hooks, data pipelines) with at-least-once or exactly-once semantics, retries, DLQ, and multi-tenant fairness.

How it works

  • Scheduler picks due jobs from a time-ordered store and enqueues execution messages.
  • Workers pull jobs, execute, ack or retry with backoff.

Analogy: A school bell system that doesn’t miss periods even if one classroom’s clock is wrong — the central office clock (scheduler) is authoritative.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Scheduler APICRUD on job definitions + manual triggers.Human/admin interface; validates cron TZ and permissions.
Leader scheduler (K8s lease)Exactly one active scheduler tick loop.Prevents duplicate firing of same cron across replicas without external vendor lock-in.
PostgreSQL metadataJob definitions, next_run, last_status.Transactional updates to job rows with FOR UPDATE SKIP LOCKED for claiming due work.
SQS / Rabbit delayedExecutable job messages.Workers compete horizontally; visibility timeout acts like a lease on job execution.
Worker poolExecutes HTTP hooks / internal tasks / K8s Jobs.Isolates failures per job with retries/DLQ.
DLQDead-letter queue for poison jobs.Prevents infinite retry loops; surfaces failures to on-call dashboards.

Shared definitions: 00-glossary-common-services.md

Low-level design

Due job discovery

  • Cassandra with time-bucket partitions (2025-04-25-14) + secondary index — good write spread.
  • Simpler: PostgreSQL SELECT ... WHERE next_run_at <= now() FOR UPDATE SKIP LOCKED with indexed next_run_at — works to moderate scale.
  • Managed: AWS EventBridge Scheduler, Azure Logic Apps recurrence, Google Cloud Scheduler.

Exactly-once illusion

  • True exactly-once execution is impossible across networks; use idempotent handlers + dedupe store (job_run_id in DynamoDB TTL).

Fairness

  • Per-tenant queues to prevent one tenant’s million jobs from starving others.

Misfire policy

  • Fire now, skip, or coalesce — configurable per job type.

E2E: recurring daily report

Rendering diagram…

Tricky parts

ProblemSolution
Split brain two schedulersK8s lease / DynamoDB conditional lock / Redlock careful
Clock skewUTC only; NTP monitoring
Long jobsSeparate long-running runner (Step Functions) vs short HTTP

Caveats

  • Cron syntax timezones — store IANA tz per job; DST bugs are common.
  • Backfill storms after outage — rate limit catch-up executions.

Azure

  • Durable Functions timers; Azure Functions + Storage Queues; Service Bus sessions for ordering.