Web Crawler at Scale

Problem statement

Continuously discover and fetch billions of pages, respect robots.txt and rate limits, deduplicate, and feed an indexing pipeline without getting blocked or melting origin sites.

How it works

Scheduler prioritizes URLs (PageRank, freshness, sitemap hints).
Fetcher pulls HTTP(S); politeness enforces per-host QPS using token buckets.
Robots cached with TTL; refetch on Cache-Control hints when present.

Analogy: A polite librarian who reads the “only 2 books per minute” sign (robots.txt) before pulling books from each shelf (host).

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Seed / sitemap ingest	Bootstraps known URLs.	Gives legal starting points vs random IP scanning.
Frontier (Kafka per host shard)	Ordered work queue of URLs.	Kafka preserves per-host ordering for politeness while still parallelizing across hosts.
Fetcher workers	HTTP clients with pooling, DNS cache, robots cache.	Stateless scale-out unit; autoscale on queue lag.
S3 raw WARC/HTML	Durable storage of fetched bytes.	Replay parsing/indexing without re-fetching (saves bandwidth and respects sites).
Bloom filter	Probabilistic “probably seen URL” set.	Cheap dedup before hitting expensive DB checks (see bloom doc).
Indexing Spark jobs	Distributed batch builds inverted index.	TB-scale merges not feasible on one machine.

Shared definitions: 00-glossary-common-services.md

Low-level design

Frontier storage

Kafka partitioned by hash(host) for ordering and fairness across hosts.
Cassandra optional for URL metadata (last fetch, etag).

Politeness

Per-host semaphore in Redis or local token bucket coordinated via Redis cell library.
robots.txt parser cached in Memcached key robots:host.

Dedup

Bloom filter (in-memory per worker + Redis BF cluster) → false positives enqueue duplicate rarely; SURT canonical URL normalization.

Rendering

Headless Chrome pool only for domains known to need JS — cost guardrail.

Compliance

Do-not-crawl list; geo IP block for embargoed regions.

E2E: fetch one URL

Rendering diagram…

Tricky parts

Problem	Solution
DNS as bottleneck	Async resolver pool; cache aggressively
Redirect chains	Max depth; canonical URL detection
Infinite calendars	URL pattern denylist; depth budget

Caveats

Legal: ToS of target sites; copyright on snapshots.
Ethics: robots is advisory in some jurisdictions but industry norm — follow it.

Azure mapping

Azure Event Hubs frontier; Blob Storage raw; Azure DNS; Playwright on Container Apps.