notes

Experiment 009

Binary quantization + two-stage rerank (the Exa funnel)

Perf record: 009-binary-rerank.json. Cohere v3 Wikipedia, 1,000,000 × 1024-d, cosine (unit-norm → L2/dot), measured on the Granite Rapids box (memory-bound).

What we did

Binary quantization (1 bit/dim = sign), Hamming distance via XOR + popcount, then the two-stage funnel: binary scan → top-C candidates → exact f32 rerank → top-k. Swept C to get the recall curve. This is Exa's middle layer (concepts.md §L218, §L232).

The full curve

schemerecall@10QPSp50 latencymemory
f32 (exact)1.0009.5644 ms3906 MB
int8 (008)0.98319.8589 ms977 MB
binary, no rerank0.46861310 ms122 MB
binary + rerank C=200.6305838 ms122 MB
binary + rerank C=500.7985739 ms122 MB
binary + rerank C=1000.8876158 ms122 MB
binary + rerank C=2000.9426109 ms122 MB
binary + rerank C=5000.98359810 ms122 MB
binary + rerank C=10000.99458510 ms122 MB

Conclusions

  1. Without rerank, binary is unusable: recall 0.468. Sign-bit Hamming alone loses half the true neighbors — but it's blazing (613 QPS = 64× f32, 122 MB = 32× smaller). So binary alone = fast + tiny + wrong.

  2. Rerank is the whole trick, and it's nearly free. Recall climbs 0.47 → 0.89 (C=100) → 0.98 (C=500) → 0.99 (C=1000), while QPS barely moves (615 → 585 across C=10→1000). Exact-f32 rescoring a few hundred candidates costs almost nothing next to the binary scan over 1M. This is why the funnel works: stage-1 just has to get the true neighbors into the top-C pool; stage-2 fixes the order for free.

  3. Binary + rerank Pareto-dominates everything. At C=1000: recall 0.994 (≈ f32), 62× the throughput (585 vs 9.5 QPS), 65× lower latency (10 vs 644 ms), 32× less memory (122 MB vs 3.9 GB). It beats both f32 (quality-tied, vastly cheaper) and int8 (better on every axis). This is the headline result of the whole project: ~exact quality at a fraction of the cost.

  4. "What's a good C": the knee is ~500–1000. recall 0.98 at C=500, 0.99 at C=1000; diminishing after. And since QPS is flat across C (rerank is cheap), over-retrieving is nearly free — pick C for your recall target and stop worrying about its cost. (Confirms the earlier prediction with a curve instead of a guess.)

Where we are (001 → 009)

recallQPSmemoryvs f32
f32 exact1.0009.53906 MB
binary + rerank C=10000.994585122 MB62× faster, 32× smaller, ~same recall

The arc: exact baseline → diagnosed compute-bound → fixed the kernel (FMA) → flipped to memory-bound → quantization (int8 4×, binary 32×) → rerank recovers quality at no cost. The funnel lands ~exactly where Exa's design says it should.

Caveats