Experiment 024

Rerank candidate prefetch: neutral

Perf record: 024-rerank-prefetch.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --batch 16 --rerank-pf.

The idea

The rerank rescores C candidates by random-access gather into the 3.9 GB f32 store — unlike the sequential scan, a legit software-prefetch target. The hardware prefetcher only streams within a row once touched; it can't see the next random candidate. So prefetch the row PF=8 ahead to hide its initial DRAM latency.

Result — no change

scan-bits	C	recall	no prefetch	prefetch
1024	1000	0.9931	848.8	848.0
1024	2000	0.9975	731.3	729.7
512	4000	0.9601	644.6	634.4

Within noise (CV < 1%), slightly negative at high C.

Why it does nothing here

Prefetch buys latency-hiding only when there is spare memory bandwidth. Under the 8-core batch this workload is bandwidth-saturated (it's why tiling worked), so there's no idle bandwidth for prefetched lines to ride on — the prefetch just competes for the same saturated bus. And rayon-over-queries already issues many independent gathers concurrently, so the memory system has plenty of in-flight requests (memory-level parallelism) without explicit hints. Net: nothing to gain, and at high C the extra prefetch instructions are slight overhead.

This is consistent with 015 (rerank isn't the batch bottleneck) and 016/023 (the scan, and the saturated memory bus, are the real constraints).

Conclusion

Dropped as a default (--rerank-pf kept opt-in). Prefetch would only matter in a latency path with one query and spare bandwidth — but there the scan dominates (021/022), so it still wouldn't move the needle.

Caveats

x86-only intrinsic; no-op on other arches. Batch path; reps=6.