notes

Experiment 054

Product Quantization: wins recall-per-byte, loses recall-per-QPS

Perf record: 054-product-quantization.json. Granite box (8 vCPU). src/bin/pq.rs. Cohere 1M × 1024, K=256/subspace, trained on a 100k sample (12 k-means iters), recall@10 vs exact GT. Closes the open branch from 044.

What PQ is

Split each vector into M subvectors, k-means each subspace (K=256 → 1 byte/subspace), so a vector is M bytes. Search by ADC: per query, precompute an M×256 table of subvector→centroid distances; each doc's distance is M table lookups summed.

Result

Mbytes/vecrecall C=0recall C=200recall C=1000QPS
880.0710.3730.675~590
16160.2060.6830.905~290
32320.4050.9240.992140
64640.5930.9930.999862
binary funnel (051)1280.998963

The verdict: better bytes, worse speed

Why — the 011 lesson, at scale

PQ's ADC is M data-dependent gathers per doc (look up table[s][code[s]]), and gathers don't autovectorize. The binary funnel's per-doc op is count_onesVPOPCNTDQ (8 u64/instr). And PQ here is cache-resident (16–64 MB ≪ L3), so it's not bandwidth-bound — it's gather-compute-bound, and scalar gather loses to vector popcount. This is exactly 011 (the asymmetric LUT lost to popcount) reproduced at 1M scale: fewer bytes don't buy QPS when the per-doc op is a slow gather.

Conclusion

The binary funnel stays the QPS winner. PQ is the footprint / recall-per-byte option — valuable only when RAM is the binding constraint (8–32 B/vec vs 128 B), and even then the funnel + disk hybrid (045) is an alternative. To make PQ QPS-competitive you'd need a SIMD ADC (4-bit codes + vpshufb lookup, the FAISS PQ4 trick) — which is the same unsafe-SIMD path parked in 050. Scalar PQ on CPU does not beat popcount.

Caveats