Experiment 018
Combined recall↔QPS Pareto (tiling × prefix × C)
Perf record: 018-combined-pareto.json.
Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --batch 16 --scan-bits N --rerank C.
What we did
Compose the two batch levers — tiling (016, free) and prefix truncation (014, recall-for-bandwidth) — and sweep rerank C, to draw the actual recall-vs-QPS frontier and pick the best operating point at each recall target. All runs at tile=16.
The frontier
| scan-bits | C | recall@10 | QPS | on Pareto |
|---|---|---|---|---|
| 1024 | 2000 | 0.9975 | 731 | ● |
| 1024 | 1000 | 0.9931 | 850 | ● |
| 768 | 2000 | 0.9864 | 762 | |
| 768 | 1000 | 0.9703 | 893 | ● |
| 768 | 4000 | 0.9947 | 588 | |
| 512 | 2000 | 0.9303 | 829 | ● |
| 512 | 4000 | 0.9601 | 626 | |
| 384 | 4000 | 0.8999 | 599 |
Conclusions
- Tiling is the universal lever; prefix is a low-recall lever. At high recall (≳0.97) the full 1024-bit scan + tiling dominates — 0.993 @ 850, 0.9975 @ 731. Truncation only wins below ~0.97 (768/C=1000 → 0.970 @ 893), because at high recall you must crank C so hard to recover the lost bits that the prefix's bandwidth saving is eaten by extra rerank. So: always tile; truncate only when you're willing to live under ~0.97 recall for more QPS.
- The whole effort beats the 009 baseline on both axes. Original 009 was 0.994 @ 585 QPS. With tiling we now hit 0.9975 @ 731 (higher recall and +25% QPS) or 0.993 @ 850 (+45% QPS at the same recall tier). Recommended default: tile=16, full bits, C=1000 → 0.993 @ 850.
- Higher C only buys recall at high bits. 768/C=4000 reaches 0.9947 but at 588 QPS — dominated by 1024/C=1000 (0.993 @ 850) and 1024/C=2000 (0.9975 @ 731). Over-retrieving on a truncated index is the worst of both.
Caveats
- Batch path (rayon over queries) — these are throughput numbers; single-request serving latency is the separate axis (017, and the prefix latency knob next).
- One back-to-back run, CV < 0.3%; spot-level absolutes drift between entries.