Search Engine (web-scale)

Problem statement

Crawl the web (or corpus), index documents, serve low-latency ranked queries with spelling correction and snippets.

How it works

Crawler discovers URLs (BFS frontier), respects robots.txt, rate limits per host.
Parser extracts text, title, anchors; pipeline cleans and tokenizes.
Indexer builds inverted index: term → list of (doc_id, positions, tf).
Query service retrieves postings, scores (BM25 + signals), returns top-k.

Analogy: A library card catalog where each word on a card points to every book containing that word — but distributed across millions of drawers (shards).

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
URL frontier (Redis/Kafka)	Work queue of URLs to fetch.	Kafka partitions by host enforce politeness and parallelism without hammering one domain from all workers.
Crawler workers	Fetch HTTP, respect robots, emit raw pages.	Stateless horizontal scale; failures retry with backoff per host.
Parser + NLP	Extract text, title, anchors; normalize encoding.	Feeds clean tokens to indexer; reduces junk in inverted lists.
Index builder (Spark/MapReduce)	Batch builds inverted index segments.	Offline heavy computation; merges postings lists efficiently at TB scale.
Query router + shards	Splits queries to index shards; merges top results.	Each shard holds subset of postings; parallelism reduces query latency.
Feature store / PageRank store	Precomputed signals (popularity, quality).	Ranking needs signals beyond BM25; stored separately for fast join at query time.

Shared definitions: 00-glossary-common-services.md

Low-level design

Frontier & politeness

Per-host queues in Redis ZSET (next_fetch_time) to enforce crawl-delay.
Bloom filter of seen URLs to avoid duplicate enqueue (tunable false positive).

Storage

Raw HTML: S3 for replay/debug.
Index: OpenSearch / Elasticsearch per shard; or custom Lucene on disk per node.
Link graph: Bigtable / Cassandra for PageRank batch jobs in Spark.

Ranking

BM25 baseline + learning-to-rank with click logs in ClickHouse.
Freshness boost for news queries (detect QDF — query deserves freshness).

Query path

Spellcheck: SymSpell index sidecar or Elasticsearch suggester.
Snippets: positions in inverted list → fetch window from stored forward index (doc store).

E2E: user query

Rendering diagram…

Tricky parts

Problem	Solution
Hot queries	CDN for instant answers; Redis top-query cache
Shard skew	Partition by hash(term) not alphabet (avoid “a” overload)
Deep pagination	search_after cursors; discourage page 1000
JavaScript-heavy sites	Headless Chrome render farm — expensive, selective

Caveats

Legal: copyright on snippets; GDPR right to be forgotten → tombstone in index.
Spam: TrustRank, Penguin-like link penalties — ongoing ML ops.

Managed options

Amazon OpenSearch Service, Azure AI Search, Elastic Cloud — trade control for ops savings.