Experiment 017
Serve the winner: binary+rerank through HTTP
Perf record: 017-serving-the-winner.json.
Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU). HTTP/JSON, concurrency sweep.
What we did
002 measured serving latency, but only for the f32 exact search. The winning
algorithm — binary scan + f32 rerank (009) — had only ever been measured
in-process (batch QPS). This entry puts the funnel behind the real server and
measures user-facing latency under load, the model that actually matters: one
single-threaded search per request, Semaphore(cores) so excess load queues.
server.rs now takes --quant binary --rerank C --scan-bits N and runs the
funnel per request (binarize query → top-C Hamming → f32 rerank). server and
loadtest gained --prefix so they serve Cohere instead of only SIFT.
Result — ~50× throughput and ~50× lower latency, end-to-end
| mode | concurrency | QPS | client p50 | client p99 | server compute p50 |
|---|---|---|---|---|---|
| f32 | 1 | 1.4 | 702 ms | 734 ms | 701 ms |
| f32 | 8 | 9.3 | 858 ms | 862 ms | 858 ms |
| f32 | 16 | 9.3 | 1714 ms | 1721 ms | 857 ms |
| f32 | 32 | 9.6 | 3428 ms | 3454 ms | 857 ms |
| binary | 1 | 82.0 | 12.2 ms | 12.7 ms | 11.6 ms |
| binary | 8 | 461.7 | 17.3 ms | 24.2 ms | 16.9 ms |
| binary | 16 | 482.6 | 32.6 ms | 43.1 ms | 16.0 ms |
| binary | 32 | 476.2 | 65.8 ms | 80.1 ms | 16.3 ms |
Conclusions
- The funnel's win holds end-to-end through the network. At concurrency = cores (8), binary+rerank serves 462 QPS at p50 17 ms vs f32's 9.3 QPS at 858 ms — ~50× throughput and ~50× lower latency, measured client-side over HTTP. The single-request floor is 12 ms (binary) vs 700 ms (f32).
- The 002 serving model reproduces exactly. Throughput ceilings at
concurrency = cores (binary plateaus ~462–483 beyond c=8; f32 ~9.3). Past that,
server compute p50 stays flat (binary 16 ms, f32 857 ms) and only the
interface/queue component grows — the rise in client p50 at c=16/32 is pure
queuing, exactly what
Semaphore(cores)is supposed to produce. - HTTP/JSON is not the bottleneck. At c ≤ cores the interface overhead is <0.7 ms; the cost is compute, as in 002. The funnel turned a 700 ms compute into a 12 ms compute, and that is the whole story.
Caveats
- Heap selection, full 1024-bit scan, C=1000, f32 rerank — the serving path uses the in-process winner's config, not yet the prefix/tile speedups (tiling is a batch optimization and doesn't apply to a single request; prefix would lower per-request compute further at a recall cost).
- 400 requests/level; spot box. p99 at low concurrency is tight (CV small); the large client-p50 at high concurrency is queuing, not compute variance.