Experiment 011
Asymmetric LUT (precompute + indexed lookup)
Perf record: 011-asym-lut.json. Cohere v3 1M, Granite box.
What we did
010's asymmetric scan iterated set bits + gathered (slow). Since docs are binary, precompute per query a table for each 4-bit nibble position — table[p][pattern] = masked query-sum for those 4 dims — and scan a doc with one indexed lookup per nibble instead of per-bit gathers. 1024-d → 256 tables × 16 = 16 KB (L1-resident). Same score → same recall. Scalar version (no SIMD gather yet).
Result
| recall (no-rr / C=100 / C=200) | QPS | |
|---|---|---|
| asym direct (010) | 0.607 / 0.976 / 0.993 | 6.7 |
| asym LUT (011) | 0.608 / 0.977 / 0.994 | 14.3 |
| symmetric + rerank C=1000 (009) | 0.994 | 465 |
Conclusions
- Recall identical to 010 — the LUT computes the same score, confirming correctness. Recall-per-C stays excellent (0.994 at C=200 vs symmetric's C=1000).
- Scalar LUT is only ~2.1× faster (6.7 → 14.3 QPS). The 256 nibble lookups/doc are scalar, data-dependent gathers into the table — not vectorized — so still ~16× more work/doc than symmetric's 16 popcounts. Net: still ~33× slower than symmetric+rerank.
- Asymmetric is still not a Pareto win. Its recall advantage (5× smaller C) is real, but the scalar kernel doesn't close the speed gap. The win requires the SIMD
vpshufb/vpermbgather over an int8-quantized table (Exa's actual kernel, 32–64 lookups/instruction) — that's the next entry.
Practical standing
Best viable today remains symmetric binary + rerank (009): recall 0.994 at ~465–585 QPS, 122 MB. Asymmetric beats it on recall-per-C but needs the SIMD LUT to beat it on speed. At billion scale the smaller-C win compounds (less rerank load), making the SIMD LUT worth the effort; at this scale symmetric+rerank is the pragmatic choice.