Experiment 035
Capstone: training-free research arc (031–035)
Perf record: 035-research-capstone-2.json.
Cohere v3, 1M × 1024, cosine, Granite box.
The principle (stated, then confirmed)
"I don't think you're going to beat binarization QPS, so anything novel must be stacked on this principle — value-added."
Correct, and the data proves it. Binarization (1 bit/dim + popcount + tiling) is the throughput floor of this engine — 851 QPS. Every training-free technique we researched either (a) stacks on the binary funnel and adds value, or (b) lives in a different Pareto region (exact / very-high recall) that is useful but never competitive on QPS.
Consolidated frontier (this session's measured points)
| tier | config | recall@10 | QPS |
|---|---|---|---|
| binary value-add | rotated + tiled, C=1000 (029) | 0.996 | 845 |
| binary value-add | rotated + tiled, C=2000 (029) | 0.999 | 730 |
| binary value-add | binads C=4000, ADSampling rerank (034, untiled) | 0.999 | 463 |
| exact accelerator | PDX + ADSampling (033) | 0.999 | 37 |
| exact accelerator | PDX plain (032) | 1.000 | 18 |
| exact baseline | naive brute force | 1.000 | 9 |
What each technique did (all training-free)
| entry | technique (paper) | outcome |
|---|---|---|
| 031 | ADSampling (SIGMOD'23) | 2.16× exact scan @ 0.999, but bandwidth-limited on row-major |
| 032 | PDX layout (SIGMOD'25) | 2× exact scan, 3× lower latency — free, autovectorized |
| 033 | PDX + ADSampling | 4× naive exact @ 0.999 — layout restores the pruning speedup |
| 034 | ADSampling rerank on the binary funnel | +5–10% funnel QPS at high C, near-lossless |
| (026–029) | RaBitQ/ITQ rotation on binary | free recall; +64% effective via the Pareto shift |
The synthesis
Two clean conclusions, both consistent with the whole project:
-
The value-adds that matter stack on binary. The random rotation (026–029, from RaBitQ) gave free recall and reshaped the frontier; ADSampling rerank (034) makes the high-recall end of the funnel cheaper. Neither changes the fact that the popcount binary scan + tiling is the throughput engine — they make it reach higher recall per QPS. That's the right kind of win.
-
The exact accelerators (PDX, ADSampling, PDX+ADSampling) are a separate tier. 4× over brute force is a real result and genuinely useful when you need recall ≥ 0.999 that binary tops out below — but at ~20–40 QPS they are not, and were never going to be, in the binary funnel's league. Knowing where a technique lives on the Pareto is the point: PDX/ADSampling are the exact tier's accelerators, binary+rotation+tiling is the approximate tier's engine, and they serve different SLAs.
The deployable recommendation is unchanged and now sharper: rotated binary + tiling for throughput (0.996 @ 845 / 0.999 @ 730), with ADSampling rerank when over-retrieving hard for recall, and PDX(+ADSampling) only when the SLA demands near-exact results.
Open (still the highest-leverage, but harder)
The fast SIMD vpshufb/LUT kernel (parked since 011/028) is the one unbuilt piece
that would let RaBitQ's superior recall-per-bit (028) run at popcount speed — the
only remaining lever that could move the binary tier itself rather than stack
beside it. Everything tested here was training-free and CPU-portable.