Experiment 016
Tiled binary scan: +73% QPS at identical recall
Perf record: 016-tiled-binary-scan.json.
Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --batch T.
The idea
014/015 nailed down that the binary scan is the bottleneck and it's bandwidth- bound: each query independently streams the whole 122 MB binary corpus. So do what 007 did for f32 — tile the queries: load each doc's 128-bit code once and compare it against a tile of T queries before moving to the next doc. The base streams once per tile instead of once per query, cutting base bandwidth ~T×. The result is exact-equivalent (same candidates, just reordered work).
Result — large throughput win, recall untouched
| tile | scan-bits | recall@10 | QPS |
|---|---|---|---|
| 1 | 1024 | 0.9931 | 485 |
| 4 | 1024 | 0.9931 | 797 |
| 8 | 1024 | 0.9931 | 838 (+73%) |
| 16 | 1024 | 0.9931 | 850 |
| 32 | 1024 | 0.9931 | 858 (+77%) |
| 8 | 512 | 0.8842 | 958 |
| 16 | 512 | 0.8842 | 984 |
Conclusions
- Tiling is a pure win here: +73% QPS at tile=8, recall bit-identical. This is the single biggest throughput gain of the funnel work, and unlike prefix truncation (014) it costs nothing in recall — it only reorders computation to reuse each doc across T queries.
- The gain saturates ~tile=16–32. Once base bandwidth is amortized, the scan re-prices back toward compute-bound (the popcount work, which tiling doesn't reduce — it's still T×N hammings). The knee at tile=8 captures most of it; past tile=16 returns are small. This is the 003→007 re-pricing cascade again, now on the binary kernel: kill the bandwidth wall and the compute wall reappears.
- It composes with prefix truncation (014). tile=16 + 512-bit → 984 QPS (at the prefix's lower recall). The two levers stack: tiling cuts re-reads of the base, prefix cuts the size of the base.
Notes
- Applies to the batch/throughput path only — a single-query request (the serving model) can't tile, so this raises batch QPS, not single-query latency. The serving entry measures the single-query side.
--batchdefault is 1 (per-query); tiling is opt-in. Heap selection + f32 rerank; the tiled path already respects--scan-bits(bbase is prefix-packed).- recall reads 0.9931 here (2000-query slice) vs 0.9943 elsewhere (1000-query) — different query subset, same ballpark.