notes

Experiment 028

RaBitQ unbiased estimator: best recall-per-bit (slow scan)

Perf record: 028-rabitq-estimator.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant rabitq (rotation auto, 2 rounds).

Research → implementation

The full RaBitQ method (not just the rotation of 026) replaces Hamming with an unbiased inner-product estimator. In the rotated space, for unit vectors:

⟨q, o⟩ ≈ ⟨q', code⟩ / ⟨code, o'⟩ = (2·Σ_{set bits} q'ᵢ − Σ q') / ‖o'‖₁

where q' is the rotated query (kept full precision), the code is the {±1/√D} sign vector, and ‖o'‖₁ is stored per data vector — that per-vector factor is what unbiases the estimate (the √D cancels). Implemented as RaBitQ::build (rotated sign bits + norm1[]) and knn_rabitq (the estimator).

Result — far better recall-per-bit, but the scan is slow

methodstage-1C=100C=500C=1000scan QPS
plain binary0.442583
rot-symmetric (026)0.4630.8990.9880.998~580
RaBitQ0.6060.9851.0001.00014.7

Conclusions

  1. RaBitQ has the best stage-1 recall by a wide margin — 0.606 vs 0.463 for rotated-symmetric (+14 pts) and 0.442 for plain binary. The unbiased estimator, keeping the query's magnitude and correcting per-vector, is a much sharper ranker than sign-vs-sign Hamming. This reproduces the paper's central claim on our data.
  2. It needs ~5–10× smaller rerank C for the same recall. RaBitQ hits 0.985 at C=100; symmetric needs C≈500–1000 to get there. At billion scale, where the f32 rerank tier dominates cost, that smaller C is the whole game.
  3. But as implemented the scan is 40× slower (14.7 vs ~580 QPS) — the estimator is an asymmetric set-bit gather (no popcount), the same kernel wall as 010/011. So today RaBitQ is recall-best, not throughput-best. RaBitQ's real implementation closes this by quantizing the query to a few bits and computing the estimate with bit-sliced popcount / vpshufb LUTs — the parked SIMD kernel.

Where this leaves it

Caveats