notes

Experiment 045

Disk-resident vectors: how slow without RAM, and the economics

Perf record: 045-disk-resident-economics.json. Granite box (Xeon 6975P-C, 8 vCPU, 15 GB RAM, EBS gp3 root nvme). src/bin/disk.rs, scripts/disk-bench.sh. Cohere v3 1M × 1024 f32 = 4.096 GB. Single-thread, serial per-query latency (so cross-config ratios are the result, not absolute QPS — the real engine is parallel/tiled, history 038).

Measured cold disk bandwidth: 135 MB/s (cat the 4 GB store cold in 30.24 s).

The question

For exact "matrix" search you eventually need everything in RAM — is that right, and how slow is it if nothing is in RAM? And can we trade RAM (expensive) for SSD (cheap) to scale down cost?

Result 1 — exact search confirms: it MUST be in RAM

configRAM residentp50 latency
exact / RAM4096 MB709 ms
exact / disk cold030,200 ms (30 s!)
exact / disk warm (cached)(cache)703 ms

Exact search reads every vector per query. Cold from disk that's the whole 4 GB / 135 MB/s = 30 s per query — exactly bandwidth-bound, ~42× slower than in-RAM. So yes: for full-scan exact search, nothing-in-RAM is a non-starter. (Even in RAM, single-thread exact is 709 ms because it streams 4 GB/query — that cost is the entire reason the funnel exists.)

Result 2 — the binary funnel breaks the RAM requirement, cheaply

The funnel scans only the 128 MB of 1-bit codes and reads only the C=200 rerank vectors per query. So the 4 GB of f32 can live on SSD:

configRAM residentp50 (cold)p50 (warm)
funnel / RAM4224 MB11.7 ms
funnel / hybrid (codes RAM, f32 disk)128 MB92 ms11.5 ms
funnel / all-disk (codes+f32 mmap)~010.2 ms

The scale-out unlock: O(C) vs O(N) disk cost

This is the crux for cost-driven scaling:

So to cut cost by pushing vectors to SSD, the funnel is the only one of the two that survives: its disk traffic is independent of dataset size.

Economics

RAM is ~40–50× more expensive per GB than EBS gp3. Per-vector storage cost for Cohere drops accordingly: full-RAM commits 4.2 GB of RAM; hybrid commits 128 MB of RAM + 4 GB of (cheap) disk — roughly a ~20× lower storage bill for the same result, free if the working set fits page cache, and a bounded ~90 ms cold penalty (further reducible with parallel reads) when it doesn't. The funnel's two-tier shape (tiny hot codes / big cold vectors) is what makes vectors-on-SSD viable.

Conclusions

  1. Exact/matrix search must be RAM-resident — cold disk is 30 s/query (O(N) bandwidth). The user's intuition is correct for exact search.
  2. The binary funnel removes that requirement: only the 128 MB of codes must be hot; the 4 GB of f32 can sit on SSD and be read C times/query. 32× less committed RAM, full-RAM speed when cached, bounded cold penalty when not.
  3. Funnel disk traffic is O(C), independent of N — the property that makes cost-driven scale-down (RAM→SSD) actually work as the corpus grows.

Caveats