notes

Experiment 010

Asymmetric scoring (full-precision query × binary doc)

Perf record: 010-asymmetric.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant asym.

What we did

Symmetric binary (009) binarizes both sides → Hamming. Asymmetric keeps the query in full precision and scores it against the docs' ±1 sign bits: score = Σ qᵢ·signᵢ = 2·(Σ qᵢ over set bits) − Σq, so ranking is just the masked query-sum. Docs stay 1 bit (122 MB); the query keeps its magnitude. Direct kernel (iterate set bits + gather) — no vpshufb LUT yet.

Result — recall up at every C, but the kernel is slow

rerank Csymmetric recallasym recall
none0.4660.607
200.6300.799
500.7980.935
1000.8870.976
2000.9420.993
scan QPS6906.7

Conclusions

  1. Asymmetric lifts stage-1 recall substantially — confirmed. Keeping the query's real values (not just its sign) makes the coarse ranking much better at every C. Asym reaches 0.993 at C=200; symmetric needed C=1000 for 0.994 → ~5× smaller rerank C for the same final quality. This is the design rationale behind Exa's asymmetric scoring (stories.md §L59).

  2. But the naive asym kernel is ~100× slower than Hamming (6.7 vs 690 QPS). Symmetric is 16 popcounts/doc; asym is ~512 gathers/doc (iterate set bits, index into the float query). So as implemented, asym is not a Pareto win — symmetric+rerank (0.994 @ 585 QPS, 009) still beats it on speed.

  3. The vpshufb/vpermb LUT is exactly what closes this gap. It computes the same asymmetric score via precomputed 4-bit tables at popcount-like speed (concepts.md §L218) — that's why Exa uses lookup tables for asymmetric rather than a direct loop. With the LUT, asym would keep the recall advantage (smaller C) at symmetric-like throughput → then it dominates.

Where this leaves the funnel

Caveats