Web Crawler at Scale
Problem statement
Continuously discover and fetch billions of pages, respect robots.txt and rate limits, deduplicate, and feed an indexing pipeline without getting blocked or melting origin sites.
How it works
- Scheduler prioritizes URLs (PageRank, freshness, sitemap hints).
- Fetcher pulls HTTP(S); politeness enforces per-host QPS using token buckets.
- Robots cached with TTL; refetch on
Cache-Controlhints when present.
Analogy: A polite librarian who reads the “only 2 books per minute” sign (robots.txt) before pulling books from each shelf (host).
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Seed / sitemap ingest | Bootstraps known URLs. | Gives legal starting points vs random IP scanning. |
| Frontier (Kafka per host shard) | Ordered work queue of URLs. | Kafka preserves per-host ordering for politeness while still parallelizing across hosts. |
| Fetcher workers | HTTP clients with pooling, DNS cache, robots cache. | Stateless scale-out unit; autoscale on queue lag. |
| S3 raw WARC/HTML | Durable storage of fetched bytes. | Replay parsing/indexing without re-fetching (saves bandwidth and respects sites). |
| Bloom filter | Probabilistic “probably seen URL” set. | Cheap dedup before hitting expensive DB checks (see bloom doc). |
| Indexing Spark jobs | Distributed batch builds inverted index. | TB-scale merges not feasible on one machine. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Frontier storage
- Kafka partitioned by
hash(host)for ordering and fairness across hosts. - Cassandra optional for URL metadata (last fetch, etag).
Politeness
- Per-host semaphore in Redis or local token bucket coordinated via Redis cell library.
- robots.txt parser cached in Memcached key
robots:host.
Dedup
- Bloom filter (in-memory per worker + Redis BF cluster) → false positives enqueue duplicate rarely; SURT canonical URL normalization.
Rendering
- Headless Chrome pool only for domains known to need JS — cost guardrail.
Compliance
- Do-not-crawl list; geo IP block for embargoed regions.
E2E: fetch one URL
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| DNS as bottleneck | Async resolver pool; cache aggressively |
| Redirect chains | Max depth; canonical URL detection |
| Infinite calendars | URL pattern denylist; depth budget |
Caveats
- Legal: ToS of target sites; copyright on snapshots.
- Ethics: robots is advisory in some jurisdictions but industry norm — follow it.
Azure mapping
- Azure Event Hubs frontier; Blob Storage raw; Azure DNS; Playwright on Container Apps.