Course/Real-World Case Studies/Design a Search Engine

Design a Search Engine

Web crawling, inverted indexes, ranking algorithms (PageRank, BM25), query parsing, autocomplete, and serving results with low latency.

25 min readHigh interview weight

Problem Statement

A web search engine crawls the internet, indexes billions of pages, and returns the most relevant results for a query in under 200 ms. This is one of the most complex engineering problems ever built, but in an interview you can decompose it into four well-defined subsystems: crawling, indexing, ranking, and serving.

Requirements

Crawl and index the web (assume ~10 B pages, re-crawl every 30 days).
Return top-10 relevant results for a query in < 200 ms.
Support autocomplete / query suggestions.
Handle 100,000 queries per second at peak.
Keep index fresh: new pages indexed within 24 hours.
Support phrase search, boolean operators, and site: filter.

Capacity Estimation

Metric	Estimate
Indexed pages	10 B
Avg page size (text after stripping HTML)	20 KB
Total index size (raw text)	~200 TB
Inverted index size (~3x compression)	~60 TB
Query QPS	100,000/sec
New pages/day to index	~300 M
Crawl throughput needed	3,500 pages/sec

High-Level Architecture

Loading diagram...

Web search engine high-level architecture

Web Crawler

The crawler is a distributed system with three components: a URL Frontier (priority queue of URLs to crawl), Fetcher workers (download HTML, respect `robots.txt`), and a Link Extractor (parse HTML, extract new URLs, feed back to frontier).

Politeness: add domain-level rate limiting — crawl the same domain at most once per second. Use a per-domain queue with a delay.
Deduplication: use a Bloom filter (false-positive OK) to avoid recrawling already-seen URLs. For 10 B URLs at 10 bytes each, a Bloom filter needs ~12 GB of memory.
Priority: prioritize high-PageRank pages and frequently updated sites (news, Wikipedia). Use a priority queue with scores.
DNS caching: cache DNS lookups per domain for 5 minutes to avoid DNS becoming a bottleneck.

Inverted Index

The inverted index maps each term to a posting list: the sorted list of document IDs (and metadata) containing that term. It is the core data structure of every search engine.

text

Inverted Index (simplified):

"python"  → [(docId=42, tf=5, pos=[12,34,67]), (docId=101, tf=2, pos=[3,19]), ...]
"design"  → [(docId=7, tf=1, pos=[4]), (docId=42, tf=3, pos=[1,5,89]), ...]
"system"  → [(docId=7, tf=8, pos=[2,6,...]), ...]

To answer query "python system design":
  1. Look up posting lists for each term
  2. Intersect docId lists (AND query)
  3. Score each document: BM25(tf, idf, doc_length)
  4. Re-rank with PageRank signal
  5. Return top 10

Ranking: BM25 + PageRank

Two signals dominate search ranking: BM25 (relevance of the document to the query based on term frequency and inverse document frequency) and PageRank (authority of the page based on inbound links). Modern search uses learned ranking models (LambdaRank, neural IR), but BM25 + PageRank is the canonical interview answer.

Signal	Measures	Computed
TF (Term Frequency)	How often the term appears in the doc	At index time
IDF (Inverse Doc Frequency)	How rare the term is across all docs	At index time
BM25 Score	Relevance of doc to query (combines TF, IDF, doc length)	At query time
PageRank	Authority based on inbound link graph	Offline batch (Spark)
Final Score	Weighted combination of BM25 + PageRank + freshness + clicks	At query time

Query Serving

Loading diagram...

Search query serving flow with sharded index

Autocomplete

Autocomplete returns the top-K completions for a query prefix in < 50 ms. The two main approaches are a Trie (prefix tree) and a database-backed approach with prefix indexes. For 100 M common queries, a Trie fits in RAM (~5 GB). Each node stores the top-10 completions pre-sorted by search frequency, updated daily by a batch job that processes query logs.

Scaling Considerations

Index sharding: shard the inverted index by term hash. Each shard holds a subset of the term dictionary. Queries fan out to all shards in parallel.
Index replication: replicate each shard for fault tolerance and read parallelism.
Caching: cache results for the top 10% of queries (which account for ~90% of traffic) in Redis with a 5-minute TTL.
Incremental indexing: new pages are indexed into a small 'delta' index served alongside the main index. Merged periodically in a Spark job.
PageRank computation: run as a Spark GraphX job over the entire link graph nightly. Store scores in a key-value store read by the ranker.

💡

Interview Tip

Don't try to design the entire Google in 45 minutes. Focus on: (1) the inverted index data structure and how queries traverse it, (2) the two-phase query pipeline (BM25 scoring on index nodes, re-ranking at query service), and (3) the crawler's politeness and deduplication mechanisms. These three topics cover 80% of what interviewers probe.

Practice this pattern

Design a web search engine like Google

Design a Ride-Sharing System (Uber)

Design an E-Commerce Platform