Experiment 025

Latency floor: the combined single-query stack (021–025 capstone)

Perf record: 025-latency-floor.json. Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU).

What we did

Stack the latency levers from this block — intra-query parallel scan (021, --query-threads 8) + parallel rerank (022) + prefix truncation (014) — and sweep the prefix/C to find the single-query latency floor at each recall tier.

Result — ~2.6 ms p50, ~4.9× below single-thread

config	recall tier	p50	p99
single-thread, full, C=1000	~0.99	12.67 ms	13.19 ms
stack, full, C=1000	~0.99	3.60 ms	4.14 ms
stack, 768-bit, C=2000	~0.97	2.61 ms	3.59 ms
stack, 512-bit, C=2000	lower	2.62 ms	2.87 ms
stack, 512-bit, C=4000	mid	3.88 ms	4.51 ms

(recall is the noisy 100-query slice; canonical recall per 018.)

Conclusions

Single-query latency floor ~2.6 ms (768-bit, C=2000, recall ~0.97) — ~4.9× below the 12.67 ms single-thread baseline, and at recall ~0.99 it's 3.60 ms.
768/C=2000 beats 512/C=4000 (2.61 vs 3.88 ms): once the scan is cheap and parallel, a bigger rerank C costs more than a narrower scan saves — even with parallel rerank. The latency optimum is a moderate prefix + moderate C, not the most aggressive truncation.
We've hit the floor for this design. The scan is bandwidth-bound and at the VPOPCNTDQ ceiling (012/023); rerank prefetch did nothing under saturation (024); parallelism (021/022) cashed the obvious latency win. Lower latency now needs less work, not faster work — i.e. an approximate index (IVF, out of scope) or a Matryoshka embedding to hold recall at fewer bits.

The 021–025 block, summarized

entry	lever	outcome
021	intra-query parallel scan	4.4× lower latency (13 → 3 ms), recall flat
022	parallel rerank	p50 −10%, p99 −16% (tail); stack → 2.87 ms
023	register-tiled kernel	negative −40–60% (defeats VPOPCNTDQ)
024	rerank prefetch	neutral (bandwidth saturated, MLP already there)
025	combined min-latency stack	floor ~2.6 ms p50 at recall ~0.97

Net across the whole effort (012–025): batch 485 → 851 QPS @ 0.993 (020) and single-query latency 12.7 → 2.6 ms (here). The wins came from moving fewer bytes (tiling, prefix) and spending cores per request (021/022); every compute or prefetch micro-opt (012, 023, 024) confirmed the kernel is already at the hardware ceiling.

Caveats

Latency-vs-throughput trade: the stack uses all cores per request (021), so it's for latency-optimized, not throughput-bound, deployment.
reps=1, 300 latency queries; spot box. Compare within the table.