Experiment 021
Intra-query parallelism: 4.4× lower single-query latency
Perf record: 021-intra-query-parallel.json.
Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU). --query-threads N.
The idea
Single-query latency (~13 ms) is one core scanning all 1M docs. But the scan is embarrassingly parallel — split the doc range across N rayon shards, each keeps a local top-C, then merge. Recall is unchanged (the global top-C is contained in the union of the shards' top-Cs). This is the opposite knob from the serving model (002), which keeps each query single-threaded to maximize throughput; here we spend all cores on one request to minimize its latency.
Result — near-halving per doubling, recall flat
| query-threads | recall@10 | p50 | p95 | p99 |
|---|---|---|---|---|
| 1 | 0.983 | 13.08 ms | 13.32 | 13.56 |
| 2 | 0.983 | 7.87 ms | 8.14 | 8.27 |
| 4 | 0.983 | 5.28 ms | 5.63 | 5.92 |
| 8 | 0.983 | 2.99 ms | 4.41 | 4.89 |
Conclusions
- 4.4× lower single-query latency (13 → 3 ms) at identical recall. A clean latency win for the case where one query at a time matters (interactive, low-QPS, or tail-latency-critical).
- Sub-linear scaling, for the expected reason. 8× the cores gives 4.4×, not 8×: the scan is bandwidth-bound, so 8 cores hammering one query saturate the memory bus before the compute scales out — plus the merge and the single-threaded rerank don't parallelize. The same memory wall that tiling exploits (016) caps this.
- It's a trade, not a free win. One request now occupies all cores, so aggregate throughput falls to ~1 query at a time. This is the dual of the serving model: use it when latency per request beats requests per second.
Caveats
- Latency pass only (
--query-threadsis wired into the single-query path); the batch QPS pass still parallelizes over queries. - recall reads 0.983 here (100-query throughput slice for the recall estimate); the point is it's identical across thread counts — the merge is exact.