notes

Experiment 050

SIMD step 1 (safe hints): the binary scan is memory-bound, not popcount-bound

Perf record: 050-simd-hints-scan.json. Granite box (Xeon 6975P-C, AVX-512 VPOPCNTDQ). src/bin/simdbench.rs. Single-thread Hamming-kernel microbench, 1024-bit codes, isolating the kernel (no heap; data cache-resident).

The question

Can we push the binary-scan Hamming kernel with SIMD? Step 1: safe, stable autovectorization hints only — before reaching for nightly core::simd or unsafe std::arch. The current kernel is a naive count_ones loop that already lowers to VPOPCNTDQ; 012/023 showed hand-restructuring it loses. So: is there headroom at all?

Result

Single-thread Gcmp/s, in-L2 (512 KB, compute-isolated) vs at-scale (25 MB, memory):

variantin-L2at scale
baseline count_ones0.2890.230
iter.zip().map().sum()0.2940.230
chunks_exact(8)0.2820.228
interleave4 (doc-level ILP)0.357 (+23%)0.233 (~0%)
manual_acc4 (the 012 loser)0.092 (−68%)0.091
  1. Safe hints buy nothing. chunks_exact, iterator fusion → parity with baseline. The compiler already emits VPOPCNTDQ; you can't hint past what it already does.
  2. manual_acc4 is 3× slower — 012/023 reconfirmed on AVX-512. Hand-splitting the popcount into manual accumulators defeats the autovectorizer (forces scalar popcnt).
  3. Doc-level ILP wins +23% — but only in cache. Keeping 4 independent full-width Hammings in flight (distinct from 023, which interleaved queries per word and went scalar) fills the popcount pipeline. At 25 MB it vanishes (0.230→0.233).

Why the in-cache win doesn't matter at scale

Two things collapse the ILP win at deployment scale:

Scan-sharing flips single-thread to compute-bound (+73%)

The first cut above missed the scan-sharing axis (T queries riding one doc-read — the tiling/carousel model). Single-thread, at-scale (25 MB), as T rises:

T (riders/doc)at-scale Gcmp/s
10.215 (memory-bound)
40.355
80.371 (+73%)
16–64~0.37 (plateau)

At T≥8 the at-scale throughput equals the in-L2 compute ceiling — the memory tax is fully amortized and it's now popcount-bound. So scan-sharing does push single- thread QPS up, hard. (The engine already captures this: tiling 016/038, carousel 039–041.)

But at full chip it's the bandwidth wall, and sharding ≠ throughput

8 cores, N=5M (640 MB > L3), T=16:

schemebase DRAM readsGcmp/s
sharded carousel×1 (shards)1.139
tiled batch (047 model)×cores1.071

They're equal — the shared 480 MB L3 absorbs the "redundant" reads, so reading the base 8× isn't 8× the DRAM traffic. Both saturate at ~1.1 Gcmp/s = 0.14/core, far below the 0.33–0.37/core single-thread compute ceiling → the chip is memory/L3- bandwidth-bound. The carousel's sharding gives no peak-throughput win over the tiled batch (its win is latency-under-burst, 041), and faster popcount can't help.

Conclusion (revised)

This closes the SIMD question for the scan: not worth core::simd/unsafe — chase bytes, not popcount.

(Where SIMD could still pay is a different kernel — e.g. the asymmetric LUT (vpshufb gather, history 011) for a recall-per-byte gain — but that's a separate path, not the symmetric Hamming scan, and was cut as non-competitive scalar.)

Caveats