Experiment 040

Sharded carousel: lowest latency floor, lower capacity

Perf record: 040-sharded-carousel.json. Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU). carousel --mode sharded.

The fix attempted

039's carousel ran 8 independent full-base carousels, so a revolution was the whole base → a high latency floor (77 ms at low load). Fix: shard the docs across workers (each owns N/8); a query rides all 8 shards in parallel and merges partial heaps. A revolution is now N/8 docs → expect ~8× lower floor.

Result — floor crushed (4.3 ms), but capacity drops

Rate sweep (seats=8):

rate	throughput	p50	p99
100	97	4.30	6.89
200	190	4.56	10.0
400	385	9.24	47.2
600	573	1052 💥	1775
800	751	2761 💥	4515

Seats sweep at overload (rate 800) — more seats does not help:

seats	throughput	p50
8	751	2790 ms
32	751	3001 ms
128	753	3363 ms

Conclusions

Sharding crushes the latency floor: 4.3 ms p50 at low load — lower than the per-worker carousel (77 ms), per-query (12 ms), and even close to the 021 intra-query-parallel floor. Because each query is spread across all 8 cores (revolution = N/8), which is exactly dynamic intra-query parallelism.
But capacity falls to ~550 QPS and then cliffs. Spreading one query over all cores means queries can't run concurrently across cores — they time-share the same 8 workers. At saturation it's compute/popcount-bound, so adding seats only lengthens the revolution (latency ↑ 2790 → 3363 ms) without raising throughput (flat ~751). The per-worker carousel (039) keeps capacity >800 because its 8 workers serve different queries in parallel.
The two designs are the two ends of the latency↔throughput dial:
- sharded = spread a query across cores → minimum latency, lower capacity.
- per-worker carousel = pack queries onto cores → maximum throughput, higher floor. This is the 021 (intra-query) vs 016 (tiling) duality, now expressed as two carousel topologies.

Toward production

Neither dominates: sharded wins at low/moderate load and tight latency SLAs; per-worker carousel wins under heavy/bursty overload. The production answer is adaptive — spread a query across cores when cores are idle (low load → ~4 ms), and pack queries per core as load rises (high load → bounded latency, full throughput). Synthesized in 041.

Caveats

Plain binary + f32 rerank C=1000; architecture study. Saturation latencies are queue-dominated (offered > capacity); the cliff location is the signal.