Experiment 051
The stacked engine: residual wired in, the full frontier
Perf record: 051-stacked-engine.json. Granite box
(Xeon 6975P-C, 8 vCPU). Cohere v3 1M × 1024, in-RAM, reps=5. Stacks every composable
in-RAM win into the main funnel: binary + rotation×2 + residual + tiling (tile=8),
swept over scan-bits and rerank width C. Residual is now wired into vsearch
(--residual); it was previously only in the cell.rs experiment.
Result — the stacked frontier (recall@10 / QPS / p50)
| id | bits | C | residual | recall | QPS | p50 |
|---|---|---|---|---|---|---|
| A | 1024 | 1000 | off | 0.9960 | 882 | 10.6 ms |
| B | 1024 | 1000 | on | 0.9986 | 883 | 10.6 ms |
| I | 1024 | 500 | on | 0.9952 | 963 | 8.0 ms |
| C | 768 | 1000 | on | 0.9936 | 819 | 7.3 ms |
| G | 768 | 500 | on | 0.9816 | 884 | 4.4 ms |
| D | 512 | 1000 | on | 0.9637 | 922 | 4.2 ms |
| H | 512 | 500 | on | 0.9296 | 978 | 3.6 ms |
| E | 512 | 1000 | off | 0.9297 | 908 | 4.3 ms |
Findings
- Residual is a free recall win at full bits. B vs A: 0.9960 → 0.9986 at identical QPS and latency — it only changes how the stage-1 codes are formed. This pushes the whole frontier past the prior best (038: 0.9968 @ 922). New max-recall point: 0.9986 @ 883 QPS.
- At low bits residual matters far more. D vs E (512-bit): 0.9297 → 0.9637 (+3.4 pts) — confirms 046 in the live engine (centering stops sign-bits from being wasted on the shared DC direction).
- Fewer bits is a LATENCY lever, not (much) a QPS lever — correcting the roofline projection. I expected 512-bit → ~1.8× QPS; instead batch QPS barely moved (882 → 922) while p50 latency more than halved (10.6 → 4.2 ms). Why: the tiled batch already amortizes the scan across the tile, and the C-rerank is a real fixed cost — so the batch isn't purely scan-bandwidth-bound. But a single query (latency) IS scan-bound, so halving the code bytes halves it. Separately, dropping C 1000→500 alone gives +9% QPS (B→I) — so C is the QPS lever, bits is the latency lever.
The best stacked operating points
- Max recall: B — 0.9986 @ 883 QPS, 10.6 ms (residual free over 038).
- Best balanced: I — 0.9952 @ 963 QPS, 8.0 ms — beats 038 (0.9968 @ 922) on both QPS and latency at ~equal recall.
- Low latency, high recall: C — 0.9936 @ 819 QPS, 7.3 ms; or G — 0.982 @ 884 QPS, 4.4 ms (58% lower latency than A).
So the stacked engine moves the frontier out on every axis vs the prior best, and opens a low-latency regime (sub-5 ms at ~0.98 recall) that didn't exist before.
What's NOT in these numbers (orthogonal add-ons)
- Adaptive-C (048): marginal in-RAM (rerank is a small batch slice), so left out of the headline; its payoff is on disk, where each rerank = an SSD read.
- Carousel (039–041): the serving layer — stacks on top for bounded tail latency under burst, no throughput change.
- Disk hybrid (045): 32× less RAM (codes in RAM, f32 on SSD) — same warm latency.
Caveats
- In-RAM, single box, 8 cores, one dataset (Cohere 1024). Recall vs the file's GT.
- "Cell = full 1M base"; a real k-means IVF cell is tighter around its centroid, so residual should help at least as much.
- 768 bits isn't a power of two but is fine — it's a prefix of the rotated (full-1024) code.