notes

Experiment 013

Counting selection for the binary top-C (a latency/throughput tradeoff)

Perf record: 013-counting-selection.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --select heap|count.

The idea

009's binary scan keeps the top-C candidates with a bounded max-heap — O(n log C) with a branchy compare-and-maybe-sift on every doc. But Hamming distance is a small bounded integer (∈ [0, dim]), so selection doesn't need comparisons at all: histogram the distances (O(n), branch-free), find the threshold bucket holding the C-th smallest, and collect every id at or below it. New --select count picks this; --select heap is the original.

Candidate order isn't preserved, which is fine — the rerank stage rescores and re-sorts anyway.

Result — lower latency, lower batch throughput

Cselectrecall@10batch QPSp50 (single-query)
0heap0.468158713.87 ms
0count0.468156112.61 ms
200heap0.941958414.33 ms
200count0.941954912.85 ms
1000heap0.994355015.32 ms
1000count0.994351113.16 ms (−14%)

Recall is identical (same candidate set up to ties). Counting cuts single-query latency ~9–14%, but loses ~5–7% batch QPS.

Why the split — and why counting still matters for serving

The two passes measure different things:

This matters because the serving model (002) is one single-threaded search per request, where server throughput = cores ÷ per-query latency. By that measure counting is the win: 15.3 → 13.2 ms/query is ~16% more server throughput per core, even though the all-cores batch number drops. The batch benchmark and the serving benchmark reward opposite choices here.

Conclusions

  1. Kept both; default stays heap. Heap maximizes the batch-QPS headline. --select count is the choice for latency-bound / serving workloads (revisited in the serving entry).
  2. Selection was never the scan's main cost — the popcount scan dominates, so swapping the selector moves things only single-digit percent. The bigger levers are the scan's bandwidth and the rerank tier (next entries).

Caveats