Split Block Bloom Filter

Posted on Sun Dec 14 2025 00:00:00 GMT+0000 (Coordinated Universal Time)

Split Block Bloom Filter

A bloom filter is a fantastic data structure for optimizing your application. But what if I told you that you are leaving a significant amount of performance on the table?

CPU Cache performance is a blind spot for many developers, especially in data-heavy and AI-adjacent systems. To understand where performance is lost, we need to briefly look at how modern CPUs actually access memory.

CPU Cache Hierarchy
Registers
├─ Latency: ~0.3 ns
├─ Cycles: 1
└─ Scope: Single core

L1 Cache (Data / Instruction)
├─ Latency: ~0.8 – 1 ns
├─ Cycles: 3 – 5
└─ Size: ~32–64 KB per core

L2 Cache
├─ Latency: ~3 – 5 ns
├─ Cycles: 10 – 15
└─ Size: ~256 KB – 1 MB per core

L3 Cache (Last Level Cache)
├─ Latency: ~10 – 20 ns
├─ Cycles: 30 – 60
└─ Size: ~8 – 64 MB (shared)

Main Memory (DRAM)
├─ Latency: ~60 – 120 ns
├─ Cycles: 180 – 360
└─ Bandwidth-limited, NUMA-sensitive

Why Classic Bloom Filters Struggle

A classic bloom filter, while elegant and memory-efficient is fundamentally memory-access heavy:

Each lookup requires k indepdent hash probes
Each probe may touch a different cache line
Under load, this results in frequent L3 and DRAM accesses
Cache misses and branch mispredictions inflate tail latency (p99)

In practice, the filter becomes memory-bound, not CPU-bound.

Enter the Split Block Bloom Filter

A split block bloom filter restructures the data layout to match CPU reality:

Instead of

TBD

#include <iostream>
int main(){ std::cout << "hi\n"; }

← back to all posts