Experiment 057
Capstone: the best stacked implementation
Synthesis of the project (entries 001–056). Cohere v3 1M × 1024, cosine, single 8-core box (Granite, Xeon 6975P-C). The question this whole project asked: how fast can exact-ish top-k search go inside one IVF cell, and what's the best way to build and serve it?
The winner
A 1-bit binary funnel — rotation + residual + tiling — served by the carousel.
- Scan: sign-bit (1 bit/dim) Hamming codes,
count_ones→ VPOPCNTDQ, query-tiled. - Codes: random rotation (026) + residual / centroid-subtraction (046) — both free, both pure recall.
- Rerank: exact f32 on the top-C candidates.
- Serving: carousel (cooperative scan-sharing, fan-out per load) — bounded tail latency, no cliff.
Headline numbers (in-RAM, 8 cores, 1M × 1024)
| objective | recall@10 | QPS | p50 |
|---|---|---|---|
| max recall | 0.9986 | 883 | 10.6 ms |
| balanced | 0.9952 | 963 | 8.0 ms |
| predictable tail (p99≈p50) | 0.995 | ~350 | 12 ms (p99 16 ms) |
| serving envelope | 0.998 | no-cliff to ~940 | 12→49 ms |
Footprint: 128 MB of codes in RAM (+ f32 for rerank, RAM or SSD).
The full frontier — everything we tried, and where it landed
Recall vs QPS vs bytes/vector (Cohere 1M, recall@10 with rerank):
| method | bytes/vec | recall | QPS | verdict |
|---|---|---|---|---|
| binary funnel (rot+residual) | 128 | 0.995–0.999 | 880–960 | winner (QPS) |
| PQ M=64 (054) | 64 | 0.9998 | 62 | footprint, slow |
| PQ M=32 (054) | 32 | 0.992 | 140 | footprint, slow |
| PQ M=16 (054) | 16 | 0.905 | 273 | footprint, slow |
| OPQ M=16 (056) | 16 | ~0.95 (+2.3 vs PQ) | (=PQ) | better bytes, same slow |
| ITQ rotation (055) | (binary) | +1.5 vs random | (=binary) | tiny recall knob |
| HNSW-in-cell (043) | graph | ≤0.98 | ≪ funnel @ high-D | loses in-cell |
| exact / PDX (032/044, cut) | f32 | 1.0 | 9–55 | exact tier only |
Two Pareto frontiers, and they point opposite ways:
- recall-per-QPS → the binary funnel wins outright (10–15× the QPS of PQ at equal recall).
- recall-per-byte → PQ/OPQ win (32 B at 0.99 vs 128 B), but only matters when RAM is the hard constraint.
The one physical reason it all comes out this way
Popcount beats gather. The binary scan's per-doc op is count_ones → VPOPCNTDQ (8
u64/instruction, autovectorized). Every alternative's per-doc op is a data-dependent
gather — PQ/OPQ's ADC table lookups, the asymmetric LUT (011), int8 reranking — and
gathers don't vectorize. So even when PQ uses 8× fewer bytes and fits in cache, it's
gather-compute-bound and loses to popcount. This single fact (first seen in 011,
reconfirmed at scale in 050/054) decides the whole frontier: on a CPU, 1-bit + popcount
is the throughput-optimal representation, and "fewer bytes" only helps the footprint,
not the speed, unless you write SIMD ADC (parked, 050).
What each knob actually contributes
| knob | axis | effect |
|---|---|---|
| binary funnel (009) | the engine | 1-bit scan + rerank, popcount-speed |
| tiling, tile=8 (016/038) | throughput | +73% QPS (amortize base read) |
| rotation ×2 (026) | recall | free, spreads info across bits |
| residual (046) | recall | free, +3–9 pts at low bits (biggest recall lever) |
| ITQ (055) | recall | +1.5 pts over random rotation (small, dense-rotation cost) |
| carousel (039–041) | tail latency | no cliff under burst |
| disk hybrid (045) | RAM cost | 32× less committed RAM (when corpus > RAM) |
| adaptive-C (048) | disk rerank | keeps SSD reads affordable |
| PQ/OPQ (054/056) | footprint | 8–32 B/vec, but lose QPS |
These are orthogonal axes (049's roofline: QPS ≈ 1.10e9·cores/N, separable from
recall = 1 − e^12.45·N^0.28·C^−1.08·bits^−1.97). You tune recall (rotation/residual/
bits/C) without moving the throughput roofline.
Verdict
The best stacked implementation is the rotated-residual binary funnel served by the carousel: ~0.995 recall @ ~960 QPS @ 8 ms, no-cliff serving to ~940 QPS, 128 MB codes, on one 8-core box. The PQ/ITQ/OPQ exploration confirms it rather than beating it — PQ/OPQ are the footprint alternatives (use when RAM-bound), ITQ a marginal recall nudge. The throughput crown belongs to popcount.
Loose ends (all optional)
- Adaptive fan-out controller (sub-5 ms at light load) — only unbuilt engine piece.
- SIMD ADC (4-bit +
vpshufb) — the one thing that could make PQ QPS-competitive. - Above-us layer: build the IVF router to make this a full index.
web/writeup + CLAUDE.md thesis refresh.