Experiment 010
Asymmetric scoring (full-precision query × binary doc)
Perf record: 010-asymmetric.json.
Cohere v3, 1M × 1024, cosine, Granite box. --quant asym.
What we did
Symmetric binary (009) binarizes both sides → Hamming. Asymmetric keeps the query in full precision and scores it against the docs' ±1 sign bits: score = Σ qᵢ·signᵢ = 2·(Σ qᵢ over set bits) − Σq, so ranking is just the masked query-sum. Docs stay 1 bit (122 MB); the query keeps its magnitude. Direct kernel (iterate set bits + gather) — no vpshufb LUT yet.
Result — recall up at every C, but the kernel is slow
| rerank C | symmetric recall | asym recall |
|---|---|---|
| none | 0.466 | 0.607 |
| 20 | 0.630 | 0.799 |
| 50 | 0.798 | 0.935 |
| 100 | 0.887 | 0.976 |
| 200 | 0.942 | 0.993 |
| scan QPS | 690 | 6.7 |
Conclusions
-
Asymmetric lifts stage-1 recall substantially — confirmed. Keeping the query's real values (not just its sign) makes the coarse ranking much better at every C. Asym reaches 0.993 at C=200; symmetric needed C=1000 for 0.994 → ~5× smaller rerank C for the same final quality. This is the design rationale behind Exa's asymmetric scoring (
stories.md§L59). -
But the naive asym kernel is ~100× slower than Hamming (6.7 vs 690 QPS). Symmetric is 16 popcounts/doc; asym is ~512 gathers/doc (iterate set bits, index into the float query). So as implemented, asym is not a Pareto win — symmetric+rerank (0.994 @ 585 QPS, 009) still beats it on speed.
-
The
vpshufb/vpermbLUT is exactly what closes this gap. It computes the same asymmetric score via precomputed 4-bit tables at popcount-like speed (concepts.md§L218) — that's why Exa uses lookup tables for asymmetric rather than a direct loop. With the LUT, asym would keep the recall advantage (smaller C) at symmetric-like throughput → then it dominates.
Where this leaves the funnel
- Best viable today: symmetric binary + rerank C=1000 (009) — 0.994 recall, 585 QPS, 122 MB.
- Asymmetric's promise: same recall at ~5× smaller C, if the LUT brings the scan speed up. Recall half proven here; speed half is the LUT (next).
- At billion scale the smaller-C win compounds (less rerank = less of the expensive f32 tier touched), so the LUT is worth it there; at our scale it's marginal vs symmetric+rerank.
Caveats
- Direct asym kernel (no LUT) → speed is a lower bound; the LUT is the real implementation.
- Cohere v3 is compression-aware (favorable to binary in general).