Search Engine (web-scale)
Problem statement
Crawl the web (or corpus), index documents, serve low-latency ranked queries with spelling correction and snippets.
How it works
- Crawler discovers URLs (BFS frontier), respects robots.txt, rate limits per host.
- Parser extracts text, title, anchors; pipeline cleans and tokenizes.
- Indexer builds inverted index:
term → list of (doc_id, positions, tf). - Query service retrieves postings, scores (BM25 + signals), returns top-k.
Analogy: A library card catalog where each word on a card points to every book containing that word — but distributed across millions of drawers (shards).
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| URL frontier (Redis/Kafka) | Work queue of URLs to fetch. | Kafka partitions by host enforce politeness and parallelism without hammering one domain from all workers. |
| Crawler workers | Fetch HTTP, respect robots, emit raw pages. | Stateless horizontal scale; failures retry with backoff per host. |
| Parser + NLP | Extract text, title, anchors; normalize encoding. | Feeds clean tokens to indexer; reduces junk in inverted lists. |
| Index builder (Spark/MapReduce) | Batch builds inverted index segments. | Offline heavy computation; merges postings lists efficiently at TB scale. |
| Query router + shards | Splits queries to index shards; merges top results. | Each shard holds subset of postings; parallelism reduces query latency. |
| Feature store / PageRank store | Precomputed signals (popularity, quality). | Ranking needs signals beyond BM25; stored separately for fast join at query time. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Frontier & politeness
- Per-host queues in Redis ZSET (next_fetch_time) to enforce crawl-delay.
- Bloom filter of seen URLs to avoid duplicate enqueue (tunable false positive).
Storage
- Raw HTML: S3 for replay/debug.
- Index: OpenSearch / Elasticsearch per shard; or custom Lucene on disk per node.
- Link graph: Bigtable / Cassandra for PageRank batch jobs in Spark.
Ranking
- BM25 baseline + learning-to-rank with click logs in ClickHouse.
- Freshness boost for news queries (detect QDF — query deserves freshness).
Query path
- Spellcheck: SymSpell index sidecar or Elasticsearch suggester.
- Snippets: positions in inverted list → fetch window from stored forward index (doc store).
E2E: user query
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Hot queries | CDN for instant answers; Redis top-query cache |
| Shard skew | Partition by hash(term) not alphabet (avoid “a” overload) |
| Deep pagination | search_after cursors; discourage page 1000 |
| JavaScript-heavy sites | Headless Chrome render farm — expensive, selective |
Caveats
- Legal: copyright on snippets; GDPR right to be forgotten → tombstone in index.
- Spam: TrustRank, Penguin-like link penalties — ongoing ML ops.
Managed options
- Amazon OpenSearch Service, Azure AI Search, Elastic Cloud — trade control for ops savings.