Experiment 022
Parallel rerank: trim the latency tail
Perf record: 022-parallel-rerank.json.
Cohere v3, 1M × 1024, cosine, Granite box. --query-threads 8 --rerank-par.
The idea
021 parallelized the scan across cores (13 → 3 ms) but left the rerank serial:
after an 8-shard scan, one core still rescores all C candidates from f32. At C=1000
that's 1000 L2s + 1000 random gathers on a single thread. Parallelize them too
(rerank_par: rayon over candidates, then a serial top-k) and pair with the
8-shard scan to minimize single-query latency.
Result — small p50, bigger tail improvement
| C | rerank | p50 | p95 | p99 |
|---|---|---|---|---|
| 1000 | serial | 3.16 ms | 4.51 | 4.75 |
| 1000 | parallel | 2.87 ms | 3.87 | 3.98 |
| 2000 | serial | 4.38 ms | 5.62 | 5.88 |
| 2000 | parallel | 3.80 ms | 4.65 | 4.90 |
Conclusions
- Parallel rerank shaves p50 ~10% and p99 ~16%, more at larger C where rerank is a bigger slice of the request. The combined latency stack (8-shard scan + parallel rerank, C=1000) lands at 2.87 ms p50 / 3.98 ms p99 — 4.5× below the 13 ms single-thread baseline.
- The win is mostly the tail. Because the scan still dominates the request, parallelizing rerank moves p50 only a little, but it removes the serial-rerank variance that fattened p99 — useful when tail latency is the SLA.
- Diminishing returns confirm the scan is the floor. Once both stages run on all cores, single-query latency is bounded by the bandwidth-bound scan (021), not the rerank. To go lower you must scan fewer bytes — i.e. prefix (next/ capstone), not more parallelism.
Caveats
- Latency path only; pairs with
--query-threads. Uses all cores per request (the latency-vs-throughput trade from 021). - reps=1, 300 latency queries; p99 from 300 samples is indicative, not tight.