Experiment 039
Carousel (cooperative scan-sharing) under bursty load
Perf record: 039-carousel-scan-sharing.json.
Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU). src/bin/carousel.rs.
The idea (user's)
Fixed query-tiling (016/038) assumes a full tile. Real traffic is bursty, so a tiled server must either wait to fill a tile (adds latency — "the car waits at the station") or scan a partial tile (wastes the 122 MB base read on empty seats). The proposal: keep each worker continuously cycling the base; an arriving query attaches at the current cursor, rides exactly one revolution (sees all N docs), then leaves with its top-C. Every doc read is shared by whoever is aboard — dynamic tiling with tile = current in-flight count, and no waiting.
This is cooperative scan-sharing (Crescando, IBM Blink). Built as a self-contained benchmark with Poisson (bursty) arrivals, comparing the carousel to the per-query serving model (017).
Result — bounded latency, no cliff; higher floor at low load
| offered QPS | per-query p50 / p99 | carousel p50 / p99 |
|---|---|---|
| 100 | 12.5 / 16 | 77 / 172 |
| 200 | 11.5 / 16 | 45 / 104 |
| 400 | 11.8 / 21 | 32 / 63 |
| 600 | 533 / 910 💥 | 30 / 61 |
| 800 | 2045 / 3367 💥 | 35 / 59 |
(Throughput tracks offered rate for both up to ~750; per-query is saturating at 600–800 — the high latency is a growing queue — while the carousel is not.)
Conclusions
- The idea works for its intended case. Above the per-query capacity (~500 QPS) per-query latency collapses (12 → 533 → 2045 ms) as the queue explodes, while the carousel stays flat at ~30–35 ms p50 / ~60 ms p99 through 800 QPS, and sustains more throughput (scan-sharing). For bursty traffic — where bursts transiently exceed capacity — that's the whole game: the carousel keeps latency bounded during the burst; per-query spikes to seconds.
- The tradeoff is a higher latency floor at low load (77 ms vs 12 ms at 100 QPS): a query waits for a full revolution shared with whoever's aboard, and at light/bursty load there's idle-poll + cold-cache overhead. So below capacity, per-query is better on latency.
- The crossover is the design knob. Per-query wins when you're comfortably under capacity; the carousel wins at/above it. That points at an adaptive server (per-query at low load, carousel under burst) and at lowering the carousel's floor.
The obvious fix to tune next
This version runs 8 independent full-base carousels (one per worker), so a "revolution" is the whole base (N docs) → the latency floor is a full scan. Sharding the docs across workers (each owns N/8, a query rides all shards in parallel and merges) makes a revolution N/8 docs → ~8× lower floor, while keeping the sharing. That's 040.
Caveats
- Plain binary scan + f32 rerank C=1000 (no rotation); architecture comparison, so recall (~0.89) isn't the focus here.
- Warmup samples dropped; Poisson arrivals via a seeded LCG; in-process (HTTP adds <0.4 ms per 017).