notes

Experiment 059

SIMD ADC (PQ4 + swizzle_dyn): gather isn't dead — it ties/beats popcount

Perf record: 059-simd-adc-pq4.json. Granite box (AVX-512, 8 vCPU). src/bin/pq4simd.rs (nightly core::simd, no unsafe). Random codes, kernel-isolated scan throughput. Task #2 — the question parked in 050/054.

What we built

4-bit PQ codes (16 centroids/subspace) + an int8 LUT looked up with swizzle_dyn (lowers to pshufb/tbl) — 16 lookups per instruction, FAISS PQ4 style, in safe Rust. M=16 subspaces = 8 B/vec (vs the binary code's 128 B).

Result

scansingle-thread8-core @10M8-core @50M
popcount (128 B)0.200.360.37
scalar PQ (16 B)0.14
pq4-simd (8 B)1.830.720.73

What this overturns

054/057 concluded "popcount beats gather, so 1-bit binary is throughput-optimal." That was true for scalar ADC. With a SIMD LUT (swizzle_dyn), gather is no longer dominated — it's competitive-to-faster, mainly via the byte advantage. The parked question (050/054) resolves as: SIMD ADC is a real high-QPS contender.

Honest caveats (why this isn't yet "pq4 beats the funnel")

  1. The funnel tiles; this popcount doesn't. Tiling (016) amortizes the 128 B base read across a query tile, so the real funnel's effective popcount throughput is well above this microbench's 0.20–0.37. A fair tiled-vs-tiled comparison is the follow-up. (pq4 can tile too — process a tile of queries against one code block.)
  2. 4-bit recall is coarse and untested. M=16 byte-PQ was 0.90 (054); 4-bit is lower → pq4 needs a rerank tier (funnel-style) to reach 0.99. The end-to-end recall/QPS isn't measured here.

Conclusion

SIMD ADC (PQ4 + swizzle_dyn) is fast and not dominated by popcount — it closes the 13× gap scalar PQ had and wins on the byte axis at scale. It's the one technique that could rival the binary funnel on QPS. The decisive next step is tiled-vs-tiled + a recall pass (pq4-scan → exact rerank). Built with safe core::simd (no unsafe), nightly only.