notes

Experiment 005

Bound detector: compute-bound on both NEON and AVX-512

Perf records: 005-m3-bound.json (M3 NEON-128), 005-cascadelake-bound.json (Lightsail Cascade Lake AVX-512).

What we did

Replaced inference with an explicit compute-vs-memory bound detector. The Lightsail box is virtualized and exposes no PMU (perf stat<not supported> for all hardware counters), so a direct top-down read is impossible. The PMU-free method:

Working-set sweep. Shrink the base from cache-resident to RAM-sized (--base-subset N) and measure ns-per-distance (time ÷ queries ÷ N — normalized, so size cancels out). If ns/distance is flat across the cache→DRAM boundary, memory speed doesn't matter → compute-bound. If it steps up past cache → memory-bound.

Also added the rigor that was missing: --reps R with the first run discarded as warmup, reporting median + coefficient of variation instead of a single shot.

Sizes (512 B/vector): 1k=0.5 MB (L1), 8k=3.9 MB (L2), 64k=31 MB (L3), 256k=125 MB, 1M=488 MB (DRAM). 6 reps each; query count scaled to hold total work ~constant.

Results

working setM3 NEON-128 ns/distCascade Lake AVX-512 ns/dist
0.5 MB (L1)13.3420.52
3.9 MB (L2)14.2120.06
31 MB (L3)15.0219.90
125 MB12.9819.83
488 MB (DRAM)12.6619.74
max/min ratio1.18×1.04×
CV2–6%<1.2%
verdictCOMPUTE-BOUNDCOMPUTE-BOUND

Conclusions

  1. Compute-bound on both architectures — directly measured, not inferred. Growing the working set ~1000× (L1 → DRAM) changes ns/distance by <18% (M3) / <4% (Cascade Lake). If this were memory-bound, ns/distance would spike the moment the base exceeds L2/L3. It doesn't even wiggle. Memory speed is irrelevant to this kernel.

  2. Two independent methods agree. The batching-invariance test (003: cut bytes 32×, QPS unchanged) and this working-set sweep (cut the working set to cache size, ns/distance unchanged) attack the question from opposite directions and reach the same verdict on both machines. That's the cross-check the single-method 003/004 lacked.

  3. The numbers are trustworthy now. CVs are 2–6% on the M3 (laptop, thermal/background noise) and <1.2% on Cascade Lake (one at 0.03% — a quiet server CPU). These are medians of 6 reps with warmup discarded, not single shots. Earlier entries' single-shot QPS should be read as ±~10%.

  4. The small-N values are slightly higher, not lower. Counterintuitive for "cache is faster" — but it's per-query fixed overhead (rayon dispatch, heap setup) amortized over fewer distances at small N. True compute cost ≈ the large-N asymptote (M3 ~12.6 ns, Cascade Lake ~19.7 ns). Another tell that it's compute, not memory.

  5. Cascade Lake is ~1.6× slower per distance than the M3 (19.7 vs 12.6 ns) despite 4× wider SIMD — consistent with 004 (fewer/older cores, likely AVX-512 downclock, FMA-less reduction). Wider lanes don't help a serial-reduction-bound kernel.

Caveats