notes

Experiment 002

Networked serving + user-facing latency

Perf record: 002-networked-serving.json

What we did

Added the interface layer and measured what a client actually sees, not just in-process compute.

Focus per the brief: user-facing latency under production-like traffic — not startup/recovery time.

Results (concurrency sweep, 8-core mac)

concurrencyQPSclient p50client p99server compute p50interface+queue p50
12049 ms52 ms49 ms0.2 ms
8 (= cores)10077 ms145 ms76 ms0.3 ms
16100157 ms216 ms76 ms79 ms
32103310 ms355 ms75 ms234 ms

Conclusions

  1. The interface layer is not the cost — compute and queuing are. Up to core count, HTTP + JSON serialization adds <0.4 ms on top of a ~50–77 ms search. The network/API is noise here. This confirms the earlier reasoning: at tens of ms per query, the lever is the algorithm, not the transport.

  2. Throughput ceilings at ~100 QPS around concurrency = cores. Past that, more load buys zero extra QPS — the 8 cores are already saturated. Adding clients only lengthens the queue.

  3. Beyond the ceiling, latency grows linearly with load, and it's pure queuing. p50 goes 77 → 157 → 310 ms as concurrency goes 8 → 16 → 32, while server compute stays ~76 ms. The growth is entirely the "interface+queue" term (0.3 → 79 → 234 ms): requests waiting for a core. This is the classic latency-vs-load knee — the system is stable left of the knee (concurrency ≤ cores) and degrades right of it.

  4. Concurrent compute is slower than isolated compute (49 → 76 ms). Even at exactly core-count concurrency, with no queuing, p50 compute rises from 49 ms (1 client) to 76 ms (8 clients). That's the memory-bandwidth contention from entry 001 surfacing as latency: 8 cores all streaming 488 MB compete for the same memory bus. So the effective throughput ceiling (~100 QPS) is well below 8 × single-core (~160 QPS).

  5. Implication for capacity planning: to hold user-facing p50 near the floor (~50–77 ms), keep offered concurrency ≤ cores. To raise the QPS ceiling and drop latency, the answer isn't more cores (sublinear) or a faster transport (already negligible) — it's touching less memory per query, i.e. an approximate index.