Experiment 005

Bound detector: compute-bound on both NEON and AVX-512

Perf records: 005-m3-bound.json (M3 NEON-128), 005-cascadelake-bound.json (Lightsail Cascade Lake AVX-512).

What we did

Replaced inference with an explicit compute-vs-memory bound detector. The Lightsail box is virtualized and exposes no PMU (perf stat → <not supported> for all hardware counters), so a direct top-down read is impossible. The PMU-free method:

Working-set sweep. Shrink the base from cache-resident to RAM-sized (--base-subset N) and measure ns-per-distance (time ÷ queries ÷ N — normalized, so size cancels out). If ns/distance is flat across the cache→DRAM boundary, memory speed doesn't matter → compute-bound. If it steps up past cache → memory-bound.

Also added the rigor that was missing: --reps R with the first run discarded as warmup, reporting median + coefficient of variation instead of a single shot.

Sizes (512 B/vector): 1k=0.5 MB (L1), 8k=3.9 MB (L2), 64k=31 MB (L3), 256k=125 MB, 1M=488 MB (DRAM). 6 reps each; query count scaled to hold total work ~constant.

Results

working set	M3 NEON-128 ns/dist	Cascade Lake AVX-512 ns/dist
0.5 MB (L1)	13.34	20.52
3.9 MB (L2)	14.21	20.06
31 MB (L3)	15.02	19.90
125 MB	12.98	19.83
488 MB (DRAM)	12.66	19.74
max/min ratio	1.18×	1.04×
CV	2–6%	<1.2%
verdict	COMPUTE-BOUND	COMPUTE-BOUND

Conclusions

Compute-bound on both architectures — directly measured, not inferred. Growing the working set ~1000× (L1 → DRAM) changes ns/distance by <18% (M3) / <4% (Cascade Lake). If this were memory-bound, ns/distance would spike the moment the base exceeds L2/L3. It doesn't even wiggle. Memory speed is irrelevant to this kernel.
Two independent methods agree. The batching-invariance test (003: cut bytes 32×, QPS unchanged) and this working-set sweep (cut the working set to cache size, ns/distance unchanged) attack the question from opposite directions and reach the same verdict on both machines. That's the cross-check the single-method 003/004 lacked.
The numbers are trustworthy now. CVs are 2–6% on the M3 (laptop, thermal/background noise) and <1.2% on Cascade Lake (one at 0.03% — a quiet server CPU). These are medians of 6 reps with warmup discarded, not single shots. Earlier entries' single-shot QPS should be read as ±~10%.
The small-N values are slightly higher, not lower. Counterintuitive for "cache is faster" — but it's per-query fixed overhead (rayon dispatch, heap setup) amortized over fewer distances at small N. True compute cost ≈ the large-N asymptote (M3 ~12.6 ns, Cascade Lake ~19.7 ns). Another tell that it's compute, not memory.
Cascade Lake is ~1.6× slower per distance than the M3 (19.7 vs 12.6 ns) despite 4× wider SIMD — consistent with 004 (fewer/older cores, likely AVX-512 downclock, FMA-less reduction). Wider lanes don't help a serial-reduction-bound kernel.

Caveats

This is a software detector. The definitive PMU top-down (% memory-bound, achieved GB/s) still requires a bare-metal box; Lightsail can't provide it. But two agreeing software methods is strong evidence.
Verdict is for this kernel. The whole point is that the kernel's serial, FMA-less reduction is the binding constraint — fix that and the bound may shift to memory (at which point the detector earns its keep again).