Experiment 015

int8 rerank tier: a memory lever, not a speed lever

Perf record: 015-int8-rerank-tier.json. Cohere v3, 1M × 1024, cosine, Granite box. --quant binary --rerank-quant f32|i8.

The idea

The funnel reranks the top-C candidates by rescoring them from the full f32 store (3.9 GB) via random gathers. An int8 store is 4× smaller (977 MB) and 4× less traffic on that gather, so the hypothesis was: cheaper rerank → more QPS, less memory. --rerank-quant i8 reranks by int8 dot instead of f32 L2.

Result — no speed win, a recall cost, only memory improves

C	rerank	recall@10	QPS	p50
500	f32	0.9826	431	15.22 ms
500	i8	0.9686	429	15.19 ms
1000	f32	0.9943	420	15.92 ms
1000	i8	0.9783	420	15.72 ms
2000	f32	0.9986	398	17.01 ms
2000	i8	0.9819	395	16.44 ms

Conclusions

No QPS gain — because rerank was never the bottleneck. QPS is within noise of f32 at every C. 014 already showed the scan dominates and is bandwidth- bound; the rerank tier (C random gathers) is a small slice of the per-query cost, so making it cheaper barely moves the total. This is a useful negative result: it rules out the rerank tier as a throughput lever and points all remaining speed work back at the scan.
It costs ~1.5 recall points. int8 rescoring reorders the final top-k less accurately than exact f32 (0.9943 → 0.9783 at C=1000). To claw that back you'd raise C, which costs QPS — so on the recall/QPS plane i8 rerank is dominated.
The one real benefit is memory. The rerank store shrinks 3.9 GB → 977 MB (4×). Combined with the 122 MB binary scan store, total resident drops substantially. So i8 rerank is worth it only under memory pressure, accepting the recall hit (or recovering it with a larger C and eating the QPS).

Caveats

Default is --rerank-quant f32 (exact); i8 is opt-in.
int8 here is a single global-scale symmetric quantizer; per-vector or per-dim scales would narrow the recall gap, at more store overhead.