SuryanandHome

Search Engine (web-scale)

Problem statement

Crawl the web (or corpus), index documents, serve low-latency ranked queries with spelling correction and snippets.

How it works

  1. Crawler discovers URLs (BFS frontier), respects robots.txt, rate limits per host.
  2. Parser extracts text, title, anchors; pipeline cleans and tokenizes.
  3. Indexer builds inverted index: term → list of (doc_id, positions, tf).
  4. Query service retrieves postings, scores (BM25 + signals), returns top-k.

Analogy: A library card catalog where each word on a card points to every book containing that word — but distributed across millions of drawers (shards).

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
URL frontier (Redis/Kafka)Work queue of URLs to fetch.Kafka partitions by host enforce politeness and parallelism without hammering one domain from all workers.
Crawler workersFetch HTTP, respect robots, emit raw pages.Stateless horizontal scale; failures retry with backoff per host.
Parser + NLPExtract text, title, anchors; normalize encoding.Feeds clean tokens to indexer; reduces junk in inverted lists.
Index builder (Spark/MapReduce)Batch builds inverted index segments.Offline heavy computation; merges postings lists efficiently at TB scale.
Query router + shardsSplits queries to index shards; merges top results.Each shard holds subset of postings; parallelism reduces query latency.
Feature store / PageRank storePrecomputed signals (popularity, quality).Ranking needs signals beyond BM25; stored separately for fast join at query time.

Shared definitions: 00-glossary-common-services.md

Low-level design

Frontier & politeness

  • Per-host queues in Redis ZSET (next_fetch_time) to enforce crawl-delay.
  • Bloom filter of seen URLs to avoid duplicate enqueue (tunable false positive).

Storage

  • Raw HTML: S3 for replay/debug.
  • Index: OpenSearch / Elasticsearch per shard; or custom Lucene on disk per node.
  • Link graph: Bigtable / Cassandra for PageRank batch jobs in Spark.

Ranking

  • BM25 baseline + learning-to-rank with click logs in ClickHouse.
  • Freshness boost for news queries (detect QDF — query deserves freshness).

Query path

  • Spellcheck: SymSpell index sidecar or Elasticsearch suggester.
  • Snippets: positions in inverted list → fetch window from stored forward index (doc store).

E2E: user query

Rendering diagram…

Tricky parts

ProblemSolution
Hot queriesCDN for instant answers; Redis top-query cache
Shard skewPartition by hash(term) not alphabet (avoid “a” overload)
Deep paginationsearch_after cursors; discourage page 1000
JavaScript-heavy sitesHeadless Chrome render farm — expensive, selective

Caveats

  • Legal: copyright on snippets; GDPR right to be forgotten → tombstone in index.
  • Spam: TrustRank, Penguin-like link penalties — ongoing ML ops.

Managed options

  • Amazon OpenSearch Service, Azure AI Search, Elastic Cloud — trade control for ops savings.