notes

Experiment 057

Capstone: the best stacked implementation

Synthesis of the project (entries 001–056). Cohere v3 1M × 1024, cosine, single 8-core box (Granite, Xeon 6975P-C). The question this whole project asked: how fast can exact-ish top-k search go inside one IVF cell, and what's the best way to build and serve it?

The winner

A 1-bit binary funnel — rotation + residual + tiling — served by the carousel.

Headline numbers (in-RAM, 8 cores, 1M × 1024)

objectiverecall@10QPSp50
max recall0.998688310.6 ms
balanced0.99529638.0 ms
predictable tail (p99≈p50)0.995~35012 ms (p99 16 ms)
serving envelope0.998no-cliff to ~94012→49 ms

Footprint: 128 MB of codes in RAM (+ f32 for rerank, RAM or SSD).

The full frontier — everything we tried, and where it landed

Recall vs QPS vs bytes/vector (Cohere 1M, recall@10 with rerank):

methodbytes/vecrecallQPSverdict
binary funnel (rot+residual)1280.995–0.999880–960winner (QPS)
PQ M=64 (054)640.999862footprint, slow
PQ M=32 (054)320.992140footprint, slow
PQ M=16 (054)160.905273footprint, slow
OPQ M=16 (056)16~0.95 (+2.3 vs PQ)(=PQ)better bytes, same slow
ITQ rotation (055)(binary)+1.5 vs random(=binary)tiny recall knob
HNSW-in-cell (043)graph≤0.98≪ funnel @ high-Dloses in-cell
exact / PDX (032/044, cut)f321.09–55exact tier only

Two Pareto frontiers, and they point opposite ways:

The one physical reason it all comes out this way

Popcount beats gather. The binary scan's per-doc op is count_ones → VPOPCNTDQ (8 u64/instruction, autovectorized). Every alternative's per-doc op is a data-dependent gather — PQ/OPQ's ADC table lookups, the asymmetric LUT (011), int8 reranking — and gathers don't vectorize. So even when PQ uses 8× fewer bytes and fits in cache, it's gather-compute-bound and loses to popcount. This single fact (first seen in 011, reconfirmed at scale in 050/054) decides the whole frontier: on a CPU, 1-bit + popcount is the throughput-optimal representation, and "fewer bytes" only helps the footprint, not the speed, unless you write SIMD ADC (parked, 050).

What each knob actually contributes

knobaxiseffect
binary funnel (009)the engine1-bit scan + rerank, popcount-speed
tiling, tile=8 (016/038)throughput+73% QPS (amortize base read)
rotation ×2 (026)recallfree, spreads info across bits
residual (046)recallfree, +3–9 pts at low bits (biggest recall lever)
ITQ (055)recall+1.5 pts over random rotation (small, dense-rotation cost)
carousel (039–041)tail latencyno cliff under burst
disk hybrid (045)RAM cost32× less committed RAM (when corpus > RAM)
adaptive-C (048)disk rerankkeeps SSD reads affordable
PQ/OPQ (054/056)footprint8–32 B/vec, but lose QPS

These are orthogonal axes (049's roofline: QPS ≈ 1.10e9·cores/N, separable from recall = 1 − e^12.45·N^0.28·C^−1.08·bits^−1.97). You tune recall (rotation/residual/ bits/C) without moving the throughput roofline.

Verdict

The best stacked implementation is the rotated-residual binary funnel served by the carousel: ~0.995 recall @ ~960 QPS @ 8 ms, no-cliff serving to ~940 QPS, 128 MB codes, on one 8-core box. The PQ/ITQ/OPQ exploration confirms it rather than beating it — PQ/OPQ are the footprint alternatives (use when RAM-bound), ITQ a marginal recall nudge. The throughput crown belongs to popcount.

Loose ends (all optional)