notes

Experiment 016

Tiled binary scan: +73% QPS at identical recall

Perf record: 016-tiled-binary-scan.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --batch T.

The idea

014/015 nailed down that the binary scan is the bottleneck and it's bandwidth- bound: each query independently streams the whole 122 MB binary corpus. So do what 007 did for f32 — tile the queries: load each doc's 128-bit code once and compare it against a tile of T queries before moving to the next doc. The base streams once per tile instead of once per query, cutting base bandwidth ~T×. The result is exact-equivalent (same candidates, just reordered work).

Result — large throughput win, recall untouched

tilescan-bitsrecall@10QPS
110240.9931485
410240.9931797
810240.9931838 (+73%)
1610240.9931850
3210240.9931858 (+77%)
85120.8842958
165120.8842984

Conclusions

  1. Tiling is a pure win here: +73% QPS at tile=8, recall bit-identical. This is the single biggest throughput gain of the funnel work, and unlike prefix truncation (014) it costs nothing in recall — it only reorders computation to reuse each doc across T queries.
  2. The gain saturates ~tile=16–32. Once base bandwidth is amortized, the scan re-prices back toward compute-bound (the popcount work, which tiling doesn't reduce — it's still T×N hammings). The knee at tile=8 captures most of it; past tile=16 returns are small. This is the 003→007 re-pricing cascade again, now on the binary kernel: kill the bandwidth wall and the compute wall reappears.
  3. It composes with prefix truncation (014). tile=16 + 512-bit → 984 QPS (at the prefix's lower recall). The two levers stack: tiling cuts re-reads of the base, prefix cuts the size of the base.

Notes