Experiment 008
int8 quantization: a bandwidth win (so it depends on the machine)
Perf records: 008-granite-i8.json, 008-granite-f32.json.
Dataset: Cohere v3 Wikipedia, 1,000,000 × 1024-d, real floats, cosine (unit-normalized → L2 == cosine). Measured on the Granite Rapids box (memory-bound) and the M3 (compute-bound).
What we did
First real lossy quantization (SIFT was uint8 → lossless; Cohere floats are genuinely lossy). --quant i8: symmetric scalar int8, global scale, dot-product ranking. No rerank yet.
Result — same code, opposite outcome by machine
| machine | bound | f32 QPS | int8 QPS | int8/f32 | f32 ns/dist | int8 ns/dist | int8 recall | int8 mem |
|---|---|---|---|---|---|---|---|---|
| M3 | compute-bound | 15.4 | 10.5 | 0.68× (slower) | 64.8 | 95.2 | 0.9815 | 977 MB |
| Granite | memory-bound | 9.5 | 19.5 | 2.05× (faster) | 105.6 | 51.3 | 0.9830 | 977 MB |
(f32 index 3906 MB both; recall 1.0 for f32. CV ≤0.5%.)
Conclusions
-
int8's speedup is a bandwidth win — it only appears where bandwidth binds. The exact same binary is 0.68× (slower) on the compute-bound M3 and 2.05× (faster) on the memory-bound Granite box. Streaming 4× fewer bytes only helps when bytes are the constraint. This is the cleanest demonstration yet of "measure on the machine whose bottleneck matches the optimization" — and why running this on the M3 alone would have given the wrong conclusion ("quantization is useless").
-
Quality cost is tiny: recall@10 ≈ 0.983, no rerank. Cohere v3's compression-aware training showing — int8 barely moves the rankings. Rerank would push it back toward 1.0.
-
Memory drops 4× everywhere (3906 → 977 MB) — that's bandwidth-independent, true on both machines.
-
Only ~2× faster, not 4× (from 4× fewer bytes) — the re-pricing cascade again. Quantization eased the memory wall, so compute starts to re-bind (int8 ns/dist 51 is ~half f32's, not a quarter). Two reasons it's not 4×: (a) the int8 kernel is still naive —
i8→i32widen + multiply, not the hardware int-dot instructions (NEONSDOT/UDOT, x86VNNI); (b) at 977 MB you're partway back to compute-bound. A VNNI/DotProd kernel should push the speedup further toward 4×. -
"Which is better" — answered, with a caveat: on a memory-bound serving box, int8 is strictly better (2× faster, 4× smaller, ~same recall). On a compute-bound box with high bandwidth, f32 is faster. Know your bound before quantizing.
Caveats
- Naive int8 kernel (no SDOT/VNNI) — leaves speed on the table; the next lever to reach ~4×.
- No rerank — recall is 0.983, not the ~1.0 a two-stage funnel would give. Rerank + the binary (32×) path are the next entries.
- Virtualized box → no PMU; bound inferred from the working-set detector + this cross-machine flip.