Split Block Bloom Filter
Split Block Bloom Filter
A bloom filter is a fantastic data structure for optimizing your application. But what if I told you that you are leaving a significant amount of performance on the table?
CPU Cache performance is a blind spot for many developers, especially in data-heavy and AI-adjacent systems. To understand where performance is lost, we need to briefly look at how modern CPUs actually access memory.
CPU Cache Hierarchy
Registers
├─ Latency: ~0.3 ns
├─ Cycles: 1
└─ Scope: Single core
L1 Cache (Data / Instruction)
├─ Latency: ~0.8 – 1 ns
├─ Cycles: 3 – 5
└─ Size: ~32–64 KB per core
L2 Cache
├─ Latency: ~3 – 5 ns
├─ Cycles: 10 – 15
└─ Size: ~256 KB – 1 MB per core
L3 Cache (Last Level Cache)
├─ Latency: ~10 – 20 ns
├─ Cycles: 30 – 60
└─ Size: ~8 – 64 MB (shared)
Main Memory (DRAM)
├─ Latency: ~60 – 120 ns
├─ Cycles: 180 – 360
└─ Bandwidth-limited, NUMA-sensitive
Why Classic Bloom Filters Struggle
A classic bloom filter, while elegant and memory-efficient is fundamentally memory-access heavy:
- Each lookup requires k indepdent hash probes
- Each probe may touch a different cache line
- Under load, this results in frequent L3 and DRAM accesses
- Cache misses and branch mispredictions inflate tail latency (p99)
In practice, the filter becomes memory-bound, not CPU-bound.
Enter the Split Block Bloom Filter
A split block bloom filter restructures the data layout to match CPU reality:
- Instead of
TBD
#include <iostream>
int main(){ std::cout << "hi\n"; }