notes

Experiment 017

Serve the winner: binary+rerank through HTTP

Perf record: 017-serving-the-winner.json. Cohere v3, 1M × 1024, cosine, Granite box (8 vCPU). HTTP/JSON, concurrency sweep.

What we did

002 measured serving latency, but only for the f32 exact search. The winning algorithm — binary scan + f32 rerank (009) — had only ever been measured in-process (batch QPS). This entry puts the funnel behind the real server and measures user-facing latency under load, the model that actually matters: one single-threaded search per request, Semaphore(cores) so excess load queues.

server.rs now takes --quant binary --rerank C --scan-bits N and runs the funnel per request (binarize query → top-C Hamming → f32 rerank). server and loadtest gained --prefix so they serve Cohere instead of only SIFT.

Result — ~50× throughput and ~50× lower latency, end-to-end

modeconcurrencyQPSclient p50client p99server compute p50
f3211.4702 ms734 ms701 ms
f3289.3858 ms862 ms858 ms
f32169.31714 ms1721 ms857 ms
f32329.63428 ms3454 ms857 ms
binary182.012.2 ms12.7 ms11.6 ms
binary8461.717.3 ms24.2 ms16.9 ms
binary16482.632.6 ms43.1 ms16.0 ms
binary32476.265.8 ms80.1 ms16.3 ms

Conclusions

  1. The funnel's win holds end-to-end through the network. At concurrency = cores (8), binary+rerank serves 462 QPS at p50 17 ms vs f32's 9.3 QPS at 858 ms — ~50× throughput and ~50× lower latency, measured client-side over HTTP. The single-request floor is 12 ms (binary) vs 700 ms (f32).
  2. The 002 serving model reproduces exactly. Throughput ceilings at concurrency = cores (binary plateaus ~462–483 beyond c=8; f32 ~9.3). Past that, server compute p50 stays flat (binary 16 ms, f32 857 ms) and only the interface/queue component grows — the rise in client p50 at c=16/32 is pure queuing, exactly what Semaphore(cores) is supposed to produce.
  3. HTTP/JSON is not the bottleneck. At c ≤ cores the interface overhead is <0.7 ms; the cost is compute, as in 002. The funnel turned a 700 ms compute into a 12 ms compute, and that is the whole story.

Caveats