Experiment 023

Register-tiled hamming kernel: another negative result

Perf record: 023-register-tiled-kernel.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --batch 16 --tile-rt.

The idea

The tiled scan (016) compares each doc against T queries via T hamming() calls. Hypothesis: reorder to doc-word outer — load each doc word once into a register, reuse it across all T queries in the tile — to cut loads. Classic register blocking.

Result — 40–60% slower

scan-bits	per-query (VPOPCNTDQ)	register-tiled	Δ
1024	846 QPS	339 QPS	−60%
512	1056 QPS	620 QPS	−41%

Recall identical (same arithmetic).

Why it loses (the 012 lesson, again)

Per-query hamming runs count_ones over the whole word slice, which the compiler autovectorizes to VPOPCNTDQ — popcounting many 64-bit words per instruction. The register-tiled form moves the popcount inside the per-query inner loop, indexed one word at a time, which forces scalar popcnt and throws away the vectorization. And the load it was trying to save wasn't a real cost: a doc is 128 bytes and stays in L1 across the whole tile, so re-reading doc[w] per query is an L1 hit, not a memory access.

So the trade was: save a free L1 hit, lose vectorized popcount. It loses — exactly as 012 found for the non-tiled hamming. Two data points now say the same thing: don't hand-restructure popcount; let count_ones autovectorize.

Conclusion

Reverted to per-query as the default (--tile-rt kept opt-in to document the result). Combined with 012, the binary kernel's compute is at the hardware vector-popcount ceiling — there's no compute lever left that doesn't fight the autovectorizer.

Caveats

Batch path, tile=16, reps=6, CV < 0.7%; clean.