SuryanandHome

Web Crawler at Scale

Problem statement

Continuously discover and fetch billions of pages, respect robots.txt and rate limits, deduplicate, and feed an indexing pipeline without getting blocked or melting origin sites.

How it works

  • Scheduler prioritizes URLs (PageRank, freshness, sitemap hints).
  • Fetcher pulls HTTP(S); politeness enforces per-host QPS using token buckets.
  • Robots cached with TTL; refetch on Cache-Control hints when present.

Analogy: A polite librarian who reads the “only 2 books per minute” sign (robots.txt) before pulling books from each shelf (host).

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Seed / sitemap ingestBootstraps known URLs.Gives legal starting points vs random IP scanning.
Frontier (Kafka per host shard)Ordered work queue of URLs.Kafka preserves per-host ordering for politeness while still parallelizing across hosts.
Fetcher workersHTTP clients with pooling, DNS cache, robots cache.Stateless scale-out unit; autoscale on queue lag.
S3 raw WARC/HTMLDurable storage of fetched bytes.Replay parsing/indexing without re-fetching (saves bandwidth and respects sites).
Bloom filterProbabilistic “probably seen URL” set.Cheap dedup before hitting expensive DB checks (see bloom doc).
Indexing Spark jobsDistributed batch builds inverted index.TB-scale merges not feasible on one machine.

Shared definitions: 00-glossary-common-services.md

Low-level design

Frontier storage

  • Kafka partitioned by hash(host) for ordering and fairness across hosts.
  • Cassandra optional for URL metadata (last fetch, etag).

Politeness

  • Per-host semaphore in Redis or local token bucket coordinated via Redis cell library.
  • robots.txt parser cached in Memcached key robots:host.

Dedup

  • Bloom filter (in-memory per worker + Redis BF cluster) → false positives enqueue duplicate rarely; SURT canonical URL normalization.

Rendering

  • Headless Chrome pool only for domains known to need JS — cost guardrail.

Compliance

  • Do-not-crawl list; geo IP block for embargoed regions.

E2E: fetch one URL

Rendering diagram…

Tricky parts

ProblemSolution
DNS as bottleneckAsync resolver pool; cache aggressively
Redirect chainsMax depth; canonical URL detection
Infinite calendarsURL pattern denylist; depth budget

Caveats

  • Legal: ToS of target sites; copyright on snapshots.
  • Ethics: robots is advisory in some jurisdictions but industry norm — follow it.

Azure mapping

  • Azure Event Hubs frontier; Blob Storage raw; Azure DNS; Playwright on Container Apps.