Experiment 037
bf16 rerank store: not a Pareto win
Perf record: 037-bf16-rerank.json.
Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --rotate 2 --batch 16 --rerank C [--rerank-store bf16].
The hypothesis
036 pointed at the tiled funnel's rerank being random-gather bandwidth-bound: C cache-cold 4 KB f32 rows per query. So store the rerank docs as bf16 (top 16 bits of f32 — kept exponent, 7-bit mantissa), halving the gather bytes, query left full precision (asymmetric). Expectation: QPS up, recall ~unchanged.
Result — no speed, and recall drops
| C | store | recall@10 | QPS | p50 |
|---|---|---|---|---|
| 1000 | f32 | 0.9960 | 849 | 11.10 ms |
| 1000 | bf16 | 0.9918 | 862 | 11.29 ms |
| 2000 | f32 | 0.9990 | 735 | 12.63 ms |
| 2000 | bf16 | 0.9945 | 728 | 13.34 ms |
| 4000 | f32 | 0.9998 | 580 | 15.47 ms |
| 4000 | bf16 | 0.9953 | 560 | 15.45 ms |
QPS is flat-to-worse (+1.4% / −1% / −3.4%) and recall drops ~0.4 pts. Net negative.
Why it didn't work
- The bytes saved aren't free to use. Each bf16 element needs a decode
(
<<16→ f32) before the subtract — that per-element compute offsets the halved memory traffic, especially since after rotation+tiling the rerank kernel is already efficient. - Rerank isn't the whole cost. Even in the tiled funnel the binary scan is a comparable or larger share, so halving part of the rerank's bytes moves the total only single digits — and the decode eats that.
- 7-bit mantissa costs recall (~0.4 pts) — a real sacrifice, which the goal forbids.
Takeaway
Two routes now exhausted (036 fewer-flops, 037 fewer-bytes): the binary funnel's rerank can't be made lossless-faster from this angle. Combined with the scan being at the popcount/bandwidth floor (012/016/023/024), the funnel is effectively Pareto-optimal — the only remaining no-sacrifice lever is exact-equivalent tile tuning (038). bf16 remains useful purely as a memory option (rerank store 3.9 → 1.95 GB) when RAM, not recall, is the constraint.