notes

Experiment 020

Capstone: the performance effort (012–020)

Perf record: 020-capstone.json. Cohere v3, 1M × 1024, cosine, Granite box.

Head-to-head (same run, reps=8)

configrecall@10batch QPSp50
009 baseline — tile=1, full-1024, C=10000.9931484.811.84 ms
best throughput — tile=16, full, C=10000.9931851.3 (+76%)12.06 ms
best recall — tile=16, full, C=20000.9975730.2 (+51%)13.88 ms

Tiling alone buys +76% batch QPS at bit-identical recall, or you can spend some of it on recall and still land 0.9975 @ 730 QPS — higher recall and +51% throughput vs the baseline. Single-query p50 is unchanged (tiling is a batch lever).

What moved the needle, what didn't

The goal was QPS up / latency down. Across 012–020:

The throughline

Two ideas explain almost every result: (1) this workload is bandwidth-bound on the box, so the levers that win are the ones that move fewer bytes (tiling: fewer re-reads; prefix: fewer bits) — not the ones that touch compute (012, 013) or the non-bottleneck (015). (2) every time a bandwidth lever lands, the scan re-prices toward compute-bound (the 003→007 cascade, seen again at tile≈16), so gains taper — the kernel and the memory system trade being the bottleneck.

Net: the binary+rerank funnel went from ~485 QPS @ 0.993 (009 baseline) to 851 QPS @ 0.993 or 730 @ 0.9975 in batch, and serves 462 QPS @ 17 ms end-to-end (50× the f32 exact baseline). The remaining headroom is the SIMD asymmetric LUT (parked, see 011) and a Matryoshka-trained embedding (would make prefix hold recall at fewer bits, 014/018).