notes

Experiment 053

Carousel serves the real engine (best codes), and why fan-out must adapt

Perf record: 053-carousel-best-codes.json. Granite box (8 vCPU). src/bin/carousel.rs. Cohere 1M × 1024, grouped fan=4, RAM rerank C=1000.

The serving target was always the point, so the carousel should serve the real engine, not plain binary. Wired rotation×2 + residual into the carousel's code formation as defaults (rerank stays exact on raw f32). Recall is now ~0.998 (identical code path to 051), not 0.89.

Result

offered QPSthroughputp50p99
2001905.0 ms8.4 ms
4003856.5 ms18.5 ms
60057312.6 ms39.9 ms
800751672 ms 💥1325 ms
10009361762 ms 💥2936 ms

Two findings

  1. Rotation+residual compose into the carousel for free. The latency envelope is identical to plain-binary (041) — now at recall ~0.998 instead of 0.89. They're zero-scan-cost code-formation knobs, fully orthogonal to the serving layer. So the carousel now serves the real engine at no latency cost.
  2. A fixed fan cliffs. fan=4 is great to ~600 QPS (12.6 ms) then collapses at 800 (672 ms) — matching 041, where the optimal fan drops as load rises (spread a query across cores when idle; pack one-per-core under load). A single fixed fan cannot cover the load range without a cliff.

Conclusion

The carousel is now the default best-codes serving path (recall 0.998, ~5–13 ms to ~600 QPS). But the adaptive fan-out controller (F = cores ÷ in-flight, set live) is required, not optional — it's the piece that turns the per-load optimal envelope (no cliff to capacity) into something a single running server achieves automatically. That controller is the last unbuilt piece.

Update — the default (fan=1) is already a no-cliff server

fan=1 (pack), best codes, recall ~0.998:

offered QPSthroughputp50p99
20019012.1 ms16.0 ms
60057315.1 ms31.1 ms
100093749.5 ms78.4 ms
12001115542 ms 💥past capacity

No cliff to ~937 QPS (the 8-core compute capacity); it only breaks at 1200, which is genuinely past what the box can do (a cores problem, not a scheduler one).

Revised conclusion: carousel + best codes IS the default serving engine — recall ~0.998, no-cliff to capacity, 12–49 ms p50, fan=1 the safe default. fan>1 lowers low-load latency (200 QPS: ~4.5 ms at fan=8 vs 12 ms at fan=1) but cliffs under heavy load. The adaptive fan-out controller (F = cores ÷ in-flight) would get the best of both — but it's a low-load latency optimization, not a correctness need.