Experiment 007
Granite Rapids: the cascade closes (batching finally pays)
Perf records: 007-granite-bound.json, 007-granite-batch.json.
Box: c8i.2xlarge spot (Intel Xeon 6975P-C, Granite Rapids, AVX-512 + VBMI), provisioned via infra/ CDK in ap-southeast-1.
What we did
Ran the 006 FMA kernel on the newest Intel gen — the chip closest to Exa's stack (has vpermb/VBMI). Bound sweep + batch sweep + the forced-512 test, same protocol as Cascade Lake (004/006).
Results — two firsts
1. The bound FLIPPED — first machine to show it.
| working set | ns/dist |
|---|---|
| 0.5 MB (L1) | 8.75 |
| 3.9 MB (L2) | 8.69 |
| 31 MB (L3) | 8.65 |
| 125 MB | 12.20 |
| 488 MB (DRAM) | 13.50 |
Ratio 1.56× → MIXED / memory-sensitive. Cache-resident the FMA kernel runs ~8.6 ns/distance (near the M3's 7.2); but the full-base scan jumps to 13.5 ns — the fast kernel outran this box's memory bandwidth, so at full size it's (partly) memory-bound. Every prior machine was flat (compute-bound); this is the first to step up.
2. Batching finally HELPS — first machine where it does.
| batch | 1 | 4 | 8 | 16 | 32 | 64 |
|---|---|---|---|---|---|---|
| QPS | 73.0 | 112.3 | 112.3 | 113.1 | 114.3 | 114.5 |
+57% (73 → 114), then a clean plateau. Because the full-base scan is now memory-bound, amortizing each loaded vector across a query tile pays — exactly what did nothing on the compute-bound machines (003–006). recall 0.9994 throughout.
This is the re-pricing cascade, demonstrated end to end
- 003–005: batching did nothing → compute-bound (serial reduction).
- 006: fixed the kernel (FMA + accumulators) → ~25–34% faster.
- 007: the faster kernel makes the full scan memory-bound on this box → batching now works (+57%), plateauing at the compute ceiling.
The levers unlock each other in order: kernel first, then batching. Neither helped alone; together they break the ceiling. The bound detector earned its keep — it flagged MIXED, and independently batching helped (the two methods agree, again).
The 512-bit question, settled across two Intel gens
Default build (LLVM prefer-256-bit) → 256-bit (ymm) FMA → 73 QPS / 13.5 ns. Forced 512-bit (zmm) → 52.9 QPS / 18.9 ns — 27% slower (CV 0.9%). So even on Granite Rapids (minimal downclock, VBMI), forcing 512-bit loses, just less catastrophically than Cascade Lake's −32%. LLVM's 256-bit choice is right on both Intel generations. Width is not the lever.
Cross-machine summary (FMA kernel)
| machine | widest SIMD used | ns/dist cache→DRAM | bound | batching | forced-512 |
|---|---|---|---|---|---|
| M3 | NEON-128 | 5.9 → 7.2 (1.23×) | compute (near wall) | none | n/a |
| Cascade Lake | 256-bit (ymm) | ~15 → 16 (1.09×) | compute | none | −32% |
| Granite Rapids | 256-bit (ymm) | 8.6 → 13.5 (1.56×) | MIXED/memory | +57% | −27% |
The progression is the whole story: as cores get fast relative to memory, you move from compute-bound (batching useless) toward memory-bound (batching pays). Granite Rapids got there first because it pairs fast cores with a modest VM memory-bandwidth slice.
Caveats
- Virtualized (no PMU) → bound via the software working-set detector (005), not hardware top-down. The detector's MIXED verdict is corroborated by the independent batching gain.
- This is an 8-vCPU slice of a Granite Rapids socket; a full socket has far more memory bandwidth and would flip to memory-bound later (need a faster kernel / more cores to outrun it). The relative result (kernel→memory→batching) is what carries.
- Once memory-bound, the next lever is fewer bytes per distance (quantization) — still parked, but now the data shows exactly why it's next.