notes

Experiment 051

The stacked engine: residual wired in, the full frontier

Perf record: 051-stacked-engine.json. Granite box (Xeon 6975P-C, 8 vCPU). Cohere v3 1M × 1024, in-RAM, reps=5. Stacks every composable in-RAM win into the main funnel: binary + rotation×2 + residual + tiling (tile=8), swept over scan-bits and rerank width C. Residual is now wired into vsearch (--residual); it was previously only in the cell.rs experiment.

Result — the stacked frontier (recall@10 / QPS / p50)

idbitsCresidualrecallQPSp50
A10241000off0.996088210.6 ms
B10241000on0.998688310.6 ms
I1024500on0.99529638.0 ms
C7681000on0.99368197.3 ms
G768500on0.98168844.4 ms
D5121000on0.96379224.2 ms
H512500on0.92969783.6 ms
E5121000off0.92979084.3 ms

Findings

  1. Residual is a free recall win at full bits. B vs A: 0.9960 → 0.9986 at identical QPS and latency — it only changes how the stage-1 codes are formed. This pushes the whole frontier past the prior best (038: 0.9968 @ 922). New max-recall point: 0.9986 @ 883 QPS.
  2. At low bits residual matters far more. D vs E (512-bit): 0.9297 → 0.9637 (+3.4 pts) — confirms 046 in the live engine (centering stops sign-bits from being wasted on the shared DC direction).
  3. Fewer bits is a LATENCY lever, not (much) a QPS lever — correcting the roofline projection. I expected 512-bit → ~1.8× QPS; instead batch QPS barely moved (882 → 922) while p50 latency more than halved (10.6 → 4.2 ms). Why: the tiled batch already amortizes the scan across the tile, and the C-rerank is a real fixed cost — so the batch isn't purely scan-bandwidth-bound. But a single query (latency) IS scan-bound, so halving the code bytes halves it. Separately, dropping C 1000→500 alone gives +9% QPS (B→I) — so C is the QPS lever, bits is the latency lever.

The best stacked operating points

So the stacked engine moves the frontier out on every axis vs the prior best, and opens a low-latency regime (sub-5 ms at ~0.98 recall) that didn't exist before.

What's NOT in these numbers (orthogonal add-ons)

Caveats