Experiment 019

Serving latency knob: prefix scan under load

Perf record: 019-serving-latency-knob.json. Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU). HTTP, concurrency sweep. (Recall figures carried from 018; serving doesn't score recall.)

What we did

017 served the funnel at full 1024-bit (per-request compute ~11 ms). Tiling (016) can't help a single request, so the lever for serving latency is the prefix scan (014): scan fewer bits per request. Measured the server at three prefix widths, each with the C that holds its recall tier.

Result — a real latency/throughput knob, biggest under load

scan-bits	C	recall	c=1 compute p50	c=8 QPS	c=8 client p50
1024	1000	0.993	11.4 ms	465	17.1 ms
768	1000	0.970	11.1 ms	634 (+36%)	12.3 ms
512	2000	0.930	7.9 ms	732 (+57%)	10.7 ms

Conclusions

Prefix is the serving-side speed lever, and it pays most under load. At concurrency = cores it lifts QPS +36% (768) / +57% (512) and drops client p50 from 17 ms to 10.7 ms — trading recall 0.993 → 0.93.
Single-request vs under-load tells the bandwidth story again. At c=1 (one core, no contention) 768-bit barely moves compute (11.4 → 11.1 ms): the C=1000 rerank dominates and the scan saving is small in absolute terms. But at c=8 (all cores contending for memory) the same prefix gives +36% — because its real effect is cutting aggregate bandwidth demand, which only bites when every core is scanning at once. 512-bit halves the scan enough to help even at c=1 (11.4 → 7.9 ms).
Pick the width by SLA. Need 0.99 recall → full bits, 465 QPS, 17 ms. Can live with 0.97 → 768, 634 QPS, 12 ms. Latency-critical at 0.93 → 512, 732 QPS, 11 ms. The knob spans the SLA range without touching the index format.

Caveats

Heap selection, f32 rerank, no tiling (single-request path). Recall is the 018 batch number for the same scan-bits/C; the served top-k is identical logic.
400 requests/level, spot box; c=16 client p50 is queuing past the core ceiling.