Experiment 054
Product Quantization: wins recall-per-byte, loses recall-per-QPS
Perf record: 054-product-quantization.json. Granite
box (8 vCPU). src/bin/pq.rs. Cohere 1M × 1024, K=256/subspace, trained on a 100k
sample (12 k-means iters), recall@10 vs exact GT. Closes the open branch from 044.
What PQ is
Split each vector into M subvectors, k-means each subspace (K=256 → 1 byte/subspace), so a vector is M bytes. Search by ADC: per query, precompute an M×256 table of subvector→centroid distances; each doc's distance is M table lookups summed.
Result
| M | bytes/vec | recall C=0 | recall C=200 | recall C=1000 | QPS |
|---|---|---|---|---|---|
| 8 | 8 | 0.071 | 0.373 | 0.675 | ~590 |
| 16 | 16 | 0.206 | 0.683 | 0.905 | ~290 |
| 32 | 32 | 0.405 | 0.924 | 0.992 | 140 |
| 64 | 64 | 0.593 | 0.993 | 0.9998 | 62 |
| binary funnel (051) | 128 | — | — | 0.998 | 963 |
The verdict: better bytes, worse speed
- Recall-per-byte: PQ wins. M=32 matches the funnel's recall (0.992) at ¼ the bytes (32 vs 128); M=64 edges it (0.9998).
- Recall-per-QPS: PQ loses, badly. M=32 runs 140 QPS vs the funnel's 963 — 7× slower at the same recall; M=64 is 62 QPS (15×).
Why — the 011 lesson, at scale
PQ's ADC is M data-dependent gathers per doc (look up table[s][code[s]]), and
gathers don't autovectorize. The binary funnel's per-doc op is count_ones →
VPOPCNTDQ (8 u64/instr). And PQ here is cache-resident (16–64 MB ≪ L3), so it's
not bandwidth-bound — it's gather-compute-bound, and scalar gather loses to vector
popcount. This is exactly 011 (the asymmetric LUT lost to popcount) reproduced at 1M
scale: fewer bytes don't buy QPS when the per-doc op is a slow gather.
Conclusion
The binary funnel stays the QPS winner. PQ is the footprint / recall-per-byte
option — valuable only when RAM is the binding constraint (8–32 B/vec vs 128 B), and even
then the funnel + disk hybrid (045) is an alternative. To make PQ QPS-competitive you'd
need a SIMD ADC (4-bit codes + vpshufb lookup, the FAISS PQ4 trick) — which is the
same unsafe-SIMD path parked in 050. Scalar PQ on CPU does not beat popcount.
Caveats
- Scalar ADC (the slow path); SIMD ADC (4-bit + pshufb) is the known fast variant and would change the QPS story — untested (needs the parked SIMD kernel).
- K=256, 100k training sample, 12 iters; more training/iters lifts recall a little, not the QPS verdict. OPQ (learned rotation + PQ) improves recall-per-byte further (056) but not the gather-vs-popcount QPS gap.
- One dataset (Cohere 1024). Recall vs within-set exact GT.