Experiment 032

PDX vertical layout: 2× faster exact scan, free

Perf record: 032-pdx-layout.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant pdx --block B.

Research → idea

PDX (Kuffo, Krippner & Boncz, SIGMOD 2025, arXiv:2503.04422) changes the data layout, not the algorithm. Instead of storing each vector's dims contiguously (horizontal/row-major), it groups vectors into blocks and stores each block dimension-major (transposed): all vectors' dim 0, then all vectors' dim 1, …

Computing distances for a block becomes a loop over dimensions whose inner loop is over vectors — "multiple-vectors-at-a-time." That inner loop is branch-free and autovectorizes cleanly, and crucially it needs no per-vector horizontal reduction (the partial sums accumulate into a partial[v] array across the dim loop). The paper reports ~40% over SIMD horizontal and that it restores the benefit of pruning algorithms (next entry). Training-free, pure scalar code that the compiler vectorizes — no intrinsics.

How it pushes the boundary

Our exact f32 scan was the slowest path (9 QPS), and 031 showed pruning on the horizontal layout under-delivered. PDX attacks the layout itself: same bytes read, but arranged so the compute vectorizes far better and pruning (033) can read only the dims it needs across many vectors at once.

Result — ~2× throughput, ~3× lower latency, recall identical

layout	recall@10	QPS	p50
horizontal exact (`knn_batch`)	1.0000	9.3	715 ms
horizontal tiled (batch=16)	1.0000	14.9	707 ms
PDX (block=64)	1.0000	18.4	233 ms

Block size is insensitive (32→256 all ≈ 18 QPS).

Conclusions

PDX ~doubles exact-scan throughput (9.3 → 18.4) and cuts single-query latency ~3× (715 → 233 ms) at bit-identical recall — for free, by transposing the block layout. It even beats our own bandwidth-tuned tiled horizontal (14.9).
The latency win (3×) exceeds the throughput win (2×) for the now-familiar reason: single-threaded, PDX's vectorized no-reduction kernel is much faster; but the 8-core batch re-saturates memory bandwidth (PDX still reads the whole base), so the aggregate gain compresses. Same compute-vs-bandwidth tension as 012–025, seen from the layout side.
It's the substrate for fast pruning. The real PDX payoff is that dimension-pruning (ADSampling) on this layout reads only the dims it needs across a whole block at once — tested in 033.

Caveats

Exact (recall 1.0); still far below the binary funnel's QPS (851) but that's the approximate tier — PDX is the exact accelerator.
We gained ~2× vs the paper's ~40%; our horizontal l2_sq baseline is plain autovectorized (not a hand-tuned AVX-512 kernel), so PDX has more to gain here.
Build transposes the base (one-time, ~3.9 GB copy).