Bloom Filters & HyperLogLog
Probabilistic data structures for scale: Bloom filters for membership testing, counting Bloom filters, HyperLogLog for cardinality estimation, and practical applications.
Why Probabilistic Data Structures?
At large scale, exact answers are expensive. Storing every URL seen by a web crawler to check for duplicates might require terabytes of memory. Counting distinct users across billions of events is impractical with a hash set. Probabilistic data structures trade a small, bounded error rate for dramatic reductions in memory and CPU usage — often by several orders of magnitude.
Bloom Filters
A Bloom filter is a space-efficient probabilistic data structure that answers membership queries: 'is element X in the set?' It can produce false positives (saying X is in the set when it is not) but never false negatives (if it says X is not in the set, X is definitely not there). This asymmetry is exploitable in many systems.
A Bloom filter consists of a bit array of size m (initially all zeros) and k hash functions. To add an element, hash it with all k functions and set those bit positions to 1. To query, hash it with all k functions and check if all those bit positions are 1 — if any are 0, the element is definitely not in the set.
False Positive Rate
The false positive rate depends on: the bit array size m, the number of hash functions k, and the number of inserted elements n. The optimal false positive rate is approximately `(1 - e^(-kn/m))^k`. Practically: with m=10 bits/element and k=7 hash functions, the false positive rate is about 1%. Doubling m to 20 bits/element drops it to 0.001%.
class BloomFilter {
private bits: Uint8Array;
private m: number; // bit array size
private k: number; // number of hash functions
constructor(expectedItems: number, falsePositiveRate: number) {
// Optimal m and k formulas
this.m = Math.ceil(-(expectedItems * Math.log(falsePositiveRate)) / Math.LN2 ** 2);
this.k = Math.round((this.m / expectedItems) * Math.LN2);
this.bits = new Uint8Array(Math.ceil(this.m / 8));
}
add(item: string): void {
for (const pos of this.hashPositions(item)) {
this.bits[Math.floor(pos / 8)] |= 1 << (pos % 8);
}
}
mightContain(item: string): boolean {
return this.hashPositions(item).every(pos =>
(this.bits[Math.floor(pos / 8)] >> (pos % 8)) & 1
);
}
private hashPositions(item: string): number[] {
// In production: use MurmurHash3 with different seeds
return Array.from({ length: this.k }, (_, i) =>
Math.abs(this.simpleHash(item, i)) % this.m
);
}
private simpleHash(s: string, seed: number): number {
let h = seed;
for (const c of s) h = Math.imul(h ^ c.charCodeAt(0), 0x9e3779b9);
return h;
}
}
// Usage: avoid DB lookups for definitely-absent keys
const filter = new BloomFilter(1_000_000, 0.01); // 1M items, 1% FPR
// ~1.2 MB vs ~8 MB for a hash set of 1M stringsReal-World Bloom Filter Usage
| System | Bloom Filter Use |
|---|---|
| Apache Cassandra | Each SSTable has a Bloom filter; avoids disk reads for keys not in that SSTable |
| Google Bigtable | Per-tablet Bloom filter reduces disk seeks for non-existent rows |
| PostgreSQL | Query planner uses Bloom filter-based index for multi-column lookups |
| Web browsers | Safe Browsing API: local Bloom filter for malicious URLs; server query only on positive |
| Akamai CDN | One-hit-wonder detection: only cache objects seen at least twice (avoids cache pollution) |
Counting Bloom filters
Standard Bloom filters do not support deletion — you cannot unset bits because multiple elements may share a bit position. Counting Bloom filters replace each bit with a small counter (typically 4 bits). Increment on add, decrement on delete. This supports deletion at the cost of 4x memory. Used in network routers for flow counting.
HyperLogLog
HyperLogLog (HLL) estimates the cardinality (count of distinct elements) in a multiset using a tiny, fixed amount of memory. The key insight: if you hash all elements and observe the maximum number of leading zeros in any hash, you can estimate the cardinality. With p bits of prefix used as registers (2^p registers), HLL achieves a standard error of `1.04 / sqrt(2^p)`.
- 12 KB memory estimates cardinality up to billions with ~0.81% standard error (p=14).
- HLL++ (Google): improved HLL with bias correction for small and large cardinalities. Used in BigQuery.
- Redis HyperLogLog: `PFADD key element...` / `PFCOUNT key`. Uses ~12 KB per key, 0.81% standard error.
- Mergeability: two HLL sketches can be merged via element-wise max — enabling distributed counting across shards without communication overhead.
| Data Structure | Use Case | Memory (1M items) | Error Rate |
|---|---|---|---|
| Hash Set | Exact membership | ~32 MB | 0% (exact) |
| Bloom Filter | Membership testing | ~1.2 MB (1% FPR) | 1% false positives, 0% false negatives |
| Counting Bloom Filter | Membership + deletion | ~4.8 MB | ~1% false positives |
| HyperLogLog | Cardinality estimation | 12 KB (fixed!) | ~0.81% standard error |
Interview Tip
Bloom filters come up in storage system design questions ('how does Cassandra avoid disk reads for missing keys?'). HyperLogLog comes up in analytics questions ('how does YouTube count unique video views without storing every user ID?'). The key interview point for both: these structures are space-optimal approximations with mathematically bounded error. Mention Redis natively supports HyperLogLog (`PFADD`, `PFCOUNT`), and Cassandra uses Bloom filters per-SSTable to avoid unnecessary disk I/O.