notes

Experiment 006

FMA + parallel-accumulator kernel; the width experiment

Perf records: 006-m3-fma-bound.json, 006-cascadelake-fma-bound.json, 006-cascadelake-fma-batch.json.

What we did

003–005 proved the scan was compute-bound on the kernel's serial, FMA-less reduction (the .map().sum() de-vectorized to per-lane scalar adds; fmla/ vfmadd = 0). This entry fixes that and tests whether wider SIMD helps.

Results — the kernel fix helped both architectures

QPS before → afterns/dist (DRAM)latency p50
M3 (NEON-128)98 → 131 (+34%)12.6 → 7.249 → 33 ms
Cascade Lake (256-bit FMA)50 → 63 (+26%)20 → 16140 → ~110 ms

recall 0.9994 throughout. Bound (working-set sweep):

Batching still flat on both (Cascade Lake FMA: 63→67 QPS across batch 1→64) — consistent with still-compute-bound.

The width experiment: should we use 512-bit?

On Cascade Lake, target-cpu=native made LLVM emit 256-bit (ymm) FMA, not 512-bit (zmm) — deliberately, via its prefer-256-bit heuristic (AVX-512 downclock avoidance). We tested forcing 512-bit (-C target-feature=-prefer-256-bit, verified zmm in the binary):

Cascade LakeQPSns/distCV
256-bit FMA (default)63.116.0
512-bit FMA (forced)42.723.40.5%

Forcing 512-bit is 32% slower. Clean signal (CV 0.5%). The AVX-512 downclock costs more than the extra lanes buy — so the compiler's 256-bit choice was optimal, and the answer to "use 512-bit?" on this chip is a measured no.

Conclusions

  1. The real lever was the kernel, exactly as 003–005 predicted. FMA + parallel accumulators (fixing the serial reduction) gave +26–34% — portably, on both ISAs, recall unchanged. Not SIMD width, not batching, not bandwidth.

  2. SIMD width is not a lever here — and can be negative. The 128-bit M3 is the fastest per-distance machine (7.2 ns) — 2.2× faster than the 512-bit-capable Cascade Lake (16 ns). And forcing 512-bit on Cascade Lake made it worse (downclock). Wider lanes don't help a kernel that isn't throughput-bound, and on older Intel they hurt. Modern high-IPC cores (M3) beat older wide-SIMD cores.

  3. Still compute-bound — but the M3 is nearing the flip. Faster kernel → memory latency starting to show (M3 1.23×). To fully flip to memory-bound (where batching/quantization finally pay), the kernel must get faster still — which on this hardware means fewer bytes per distance (quantization), not wider f32 SIMD. That's Exa's path (binary + LUT), and it's where the roofline points next.

  4. Rigor: all numbers are medians of 6 reps, warmup discarded. Cascade Lake CV ≤0.5% (trustworthy). M3 full-base CV was ~22% (laptop thermal/turbo) — its medians hold but single M3 runs swing ±20%; the bound-sweep M3 CVs were 1–5%.

Caveats