Experiment 041
Carousel capstone: fan-out is the dial, adapt it to load
Perf record: 041-carousel-fanout-frontier.json.
Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU). carousel --mode grouped --fan F.
The unifying parameter
039 (per-worker carousel) and 040 (fully sharded) are the two ends of one dial: fan-out F — how many workers cooperate on a single query. F workers shard the base (revolution = N/F) and there are G = workers/F independent groups serving different queries. F=1 packs queries onto cores (throughput); F=8 spreads one query across all cores (latency). Sweeping F × load traces the whole frontier.
Frontier — optimal F drops as load rises (p50 ms)
| offered QPS | F=1 pack | F=2 | F=4 | F=8 shard | best |
|---|---|---|---|---|---|
| 200 | 12.5 | 7.6 | 5.1 | 4.5 | F=8 |
| 400 | 12.3 | 8.0 | 6.5 | 8.3 | F=4 |
| 600 | 14.7 | 11.5 | 11.8 | 872 💥 | F=2 |
| 800 | 20.3 | 19.5 | 367 💥 | 2520 💥 | F=1 |
| 1000 | 34.6 | 283 💥 | 1538 💥 | 4196 💥 | F=1 |
Throughput keeps up with offered load until the cliff in each cell.
The result
- The carousel idea works, and fan-out F is its control knob. The optimal F decreases monotonically with load: spread a query across all 8 cores when they're idle (low load → 4.5 ms), and pack queries one-per-core as load rises (high load → bounded latency, full throughput).
- The optimal-F envelope never cliffs: 4.5 ms @ 200 → 6.5 @ 400 → 11.5 @ 600 → 20 @ 800 → 35 @ 1000 QPS. Compare per-query (017): 12 ms flat until ~500 QPS then a cliff to seconds. The adaptive carousel is ~2.7× lower latency at low load AND has no saturation cliff — strictly better across the whole range.
- Production rule (the "right number"): F* ≈ cores ÷ in-flight-queries. Give each in-flight query an equal share of the cores. At 1 query in flight, F=cores (all-core scan, ~4 ms); at 8 in flight, F=1 (one core each, max throughput). This is elastic intra-query parallelism — it unifies 021 (intra-query parallel = high F) and 016 (tiling = F=1) under a single load-adaptive scheduler.
How to build it in production
A coordinator tracks current in-flight count m; each arriving query is dispatched
with fan-out F = clamp(cores / max(1, m), 1, cores) and rides a carousel of F
shards. As m rises the controller hands out smaller F; as it falls, larger F. No
batch-fill wait (queries attach to a moving scan), no empty-seat waste (the scan is
shared by whoever's aboard), and the latency/throughput operating point tracks the
lower envelope automatically. seats/chunk are second-order (chunk = admission
granularity; seats = per-group backpressure cap).
Conclusion
The idea is validated and understood end to end: cooperative scan-sharing with
load-adaptive fan-out delivers low latency under light/bursty load and graceful,
cliff-free degradation under overload — the production-grade serving model for this
engine. The single tuning parameter is fan-out, and its optimum is cores ÷ in-flight.
Caveats
- Plain binary + f32 rerank C=1000; architecture study (recall ~0.89 fixed). The fan-out principle is orthogonal to the recall tier (rotation/tiling still apply within each shard).
- Cliff latencies are queue-dominated (offered > capacity); the envelope is the deliverable. An explicit adaptive controller (F = cores/in-flight) is the next build to realize the envelope directly.