Design a Search Engine
Web crawling, inverted indexes, ranking algorithms (PageRank, BM25), query parsing, autocomplete, and serving results with low latency.
Problem Statement
A web search engine crawls the internet, indexes billions of pages, and returns the most relevant results for a query in under 200 ms. This is one of the most complex engineering problems ever built, but in an interview you can decompose it into four well-defined subsystems: crawling, indexing, ranking, and serving.
Requirements
- Crawl and index the web (assume ~10 B pages, re-crawl every 30 days).
- Return top-10 relevant results for a query in < 200 ms.
- Support autocomplete / query suggestions.
- Handle 100,000 queries per second at peak.
- Keep index fresh: new pages indexed within 24 hours.
- Support phrase search, boolean operators, and site: filter.
Capacity Estimation
| Metric | Estimate |
|---|---|
| Indexed pages | 10 B |
| Avg page size (text after stripping HTML) | 20 KB |
| Total index size (raw text) | ~200 TB |
| Inverted index size (~3x compression) | ~60 TB |
| Query QPS | 100,000/sec |
| New pages/day to index | ~300 M |
| Crawl throughput needed | 3,500 pages/sec |
High-Level Architecture
Web Crawler
The crawler is a distributed system with three components: a URL Frontier (priority queue of URLs to crawl), Fetcher workers (download HTML, respect `robots.txt`), and a Link Extractor (parse HTML, extract new URLs, feed back to frontier).
- Politeness: add domain-level rate limiting — crawl the same domain at most once per second. Use a per-domain queue with a delay.
- Deduplication: use a Bloom filter (false-positive OK) to avoid recrawling already-seen URLs. For 10 B URLs at 10 bytes each, a Bloom filter needs ~12 GB of memory.
- Priority: prioritize high-PageRank pages and frequently updated sites (news, Wikipedia). Use a priority queue with scores.
- DNS caching: cache DNS lookups per domain for 5 minutes to avoid DNS becoming a bottleneck.
Inverted Index
The inverted index maps each term to a posting list: the sorted list of document IDs (and metadata) containing that term. It is the core data structure of every search engine.
Inverted Index (simplified):
"python" → [(docId=42, tf=5, pos=[12,34,67]), (docId=101, tf=2, pos=[3,19]), ...]
"design" → [(docId=7, tf=1, pos=[4]), (docId=42, tf=3, pos=[1,5,89]), ...]
"system" → [(docId=7, tf=8, pos=[2,6,...]), ...]
To answer query "python system design":
1. Look up posting lists for each term
2. Intersect docId lists (AND query)
3. Score each document: BM25(tf, idf, doc_length)
4. Re-rank with PageRank signal
5. Return top 10Ranking: BM25 + PageRank
Two signals dominate search ranking: BM25 (relevance of the document to the query based on term frequency and inverse document frequency) and PageRank (authority of the page based on inbound links). Modern search uses learned ranking models (LambdaRank, neural IR), but BM25 + PageRank is the canonical interview answer.
| Signal | Measures | Computed |
|---|---|---|
| TF (Term Frequency) | How often the term appears in the doc | At index time |
| IDF (Inverse Doc Frequency) | How rare the term is across all docs | At index time |
| BM25 Score | Relevance of doc to query (combines TF, IDF, doc length) | At query time |
| PageRank | Authority based on inbound link graph | Offline batch (Spark) |
| Final Score | Weighted combination of BM25 + PageRank + freshness + clicks | At query time |
Query Serving
Autocomplete
Autocomplete returns the top-K completions for a query prefix in < 50 ms. The two main approaches are a Trie (prefix tree) and a database-backed approach with prefix indexes. For 100 M common queries, a Trie fits in RAM (~5 GB). Each node stores the top-10 completions pre-sorted by search frequency, updated daily by a batch job that processes query logs.
Scaling Considerations
- Index sharding: shard the inverted index by term hash. Each shard holds a subset of the term dictionary. Queries fan out to all shards in parallel.
- Index replication: replicate each shard for fault tolerance and read parallelism.
- Caching: cache results for the top 10% of queries (which account for ~90% of traffic) in Redis with a 5-minute TTL.
- Incremental indexing: new pages are indexed into a small 'delta' index served alongside the main index. Merged periodically in a Spark job.
- PageRank computation: run as a Spark GraphX job over the entire link graph nightly. Store scores in a key-value store read by the ranker.
Interview Tip
Don't try to design the entire Google in 45 minutes. Focus on: (1) the inverted index data structure and how queries traverse it, (2) the two-phase query pipeline (BM25 scoring on index nodes, re-ranking at query service), and (3) the crawler's politeness and deduplication mechanisms. These three topics cover 80% of what interviewers probe.
Practice this pattern
Design a web search engine like Google