Experiment 036
Tiling + ADSampling rerank: not a Pareto win
Perf record: 036-tiling-plus-adsampling-rerank.json.
Cohere v3, 1M × 1024, cosine, Granite box. --quant binads --batch 16 vs --quant binary --rotate 2 --batch 16.
The hypothesis
Tiling (016) and ADSampling rerank (034) are orthogonal stages, so stacking them
should lift QPS at no recall cost — the clean Pareto improvement the goal asks for.
Built knn_binary_funnel_tiled_ads (tiled scan + ADSampling-pruned rerank) and
compared to tiled + plain rerank, both rotated, at high C.
Result — recall held, but QPS drops 5–8%
| C | rerank | recall@10 | QPS | p50 |
|---|---|---|---|---|
| 2000 | plain | 0.9990 | 735 | 12.49 ms |
| 2000 | ADSampling eps0=3.0 | 0.9990 | 687 | 12.46 ms |
| 4000 | plain | 0.9998 | 579 | 15.69 ms |
| 4000 | ADSampling eps0=3.0 | 0.9998 | 532 | 14.58 ms |
Recall is identical and latency is marginally better, but batch QPS is lower — so it sacrifices QPS. Not a Pareto win.
Why it flipped (vs 034, where ADS rerank won untiled)
The 023 lesson, one layer up. ADSampling rerank only helped in 034 because that was
untiled — the scan still dominated, so a slightly-slower-but-pruning rerank was
hidden. Once tiling makes the scan cheap, the rerank becomes the dominant cost,
and now the rerank's own efficiency matters: plain l2_sq is a tight,
fully-autovectorized FMA loop (006), while ADSampling reranks in 32-dim batches with
an early-exit branch that defeats that vectorization. The flops saved by pruning
don't repay the lost SIMD throughput. Same trap as the register-tiled kernel (023).
The redirect
This rules out fewer flops as the rerank lever and pinpoints the real one: in the tiled funnel the rerank is random-gather bandwidth-bound — C cache-cold 4 KB f32 rows per query. To improve QPS losslessly, cut the gather bytes, not the flops → a narrower rerank store (bf16). Tested in 037.
Caveats
eps0=2.5recovers some (705 @ C=2000) but still below plain; lossless throughout.- tile=16, reps=4, CV < 0.5%.