In-memory vector search · QPS · latency · recall
Pushing vector search
QPS & latency to the ceiling
A first-principles study of how many top-k vector queries per second a single CPU box can really do over modern embeddings — and what actually moves that number.
It works through the techniques behind state-of-the-art vector search — binary quantization, learned rotations, product quantization, SIMD scan kernels, Matryoshka embeddings — building up from a brute-force baseline, measured the ANN-Benchmarks way (recall / QPS / latency), keeping only what the data justified. Sixty-six numbered experiments later, this is the engine that came out best and the lessons that got there. The full experiment log is at /notes.
The headline numbers — Cohere v3 embeddings, 1M × 1024-dim, one 8-core EC2 box:
~100× the throughput of the brute-force baseline, same recall, 128 MB of index. The same engine goes further on other silicon and other embeddings: ~2,310 QPS on 8-core AMD Zen5 (062), ~19,900 QPS on a full 192-core socket before the memory bus saturates (064), and 0.991 recall @ ~5,000 QPS, p50 3.1 ms on 256-bit Matryoshka codes (16 vCPU; 065, 066).
scan all 4 GB / query
1-bit scan + rerank
amortize the base read
make the bits count
Bar length is QPS (linear). The throughput goes ~107× from 9 to 963 — yet the recall column barely moves (≈ 0.99 throughout). That's the whole point: massive speed, ~no recall lost.
Scope — one IVF cell. This is the within-cell engine. A real index is IVF: k-means splits the corpus into thousands of clusters, each with a centroid, and a coarse router sends a query to just a few of them. Everything here is making the scan inside one cluster as fast as possible — the router that picks clusters is a layer above (out of scope). That assumption matters most for the residual below: each cluster has its own centroid to subtract.
The wall is memory, not compute
Brute force — distance to all 1M vectors per query — does ~9 QPS. The instinct is to speed up the kernel (FMA, AVX-512, batching); I did, and it roughly doubled, then stopped. The bottleneck was never compute. It's memory bandwidth: a query scans the whole base, and at 4 KB/vector that's 4 GB moved from RAM every time. You can't out-compute the bus.
That reframes everything into two levers: send fewer bytes, or spend fewer/cheaper instructions on each byte. The rest of the project is those two. (003–008)
memory-bound: 4 GB / query
4× smaller
32× smaller — fits the bus
Bar length is bytes read per query (linear). Scanning 1M vectors moves the whole base from RAM every query — no kernel trick moves 4 GB faster than the bus, so the lever is fewer bytes. 1-bit is a 32× cut.
The funnel
Fewest bytes: keep one sign bit per dimension — 32× smaller. Compare two of these codes with Hamming distance (XOR, count differing bits). It's coarse, so use it as a funnel: scan all N with the 1-bit code for a few-hundred shortlist, then re-rank only those with the real vectors. Recall ~0.99, QPS 9 → ~500 — the biggest single step. (009)
1-bit Hamming scan · 128 B/vec · popcount
exact f32 rerank · 4 KB/vec · only these
the answer
Every vector gets the cheap 1-bit comparison; only the few hundred survivors pay for the exact one. You read 4 KB/vector for 500 vectors, not a million.
Lever two: the instruction matters more than the math
Counting set bits is VPOPCNTDQ — one SIMD instruction over eight 64-bit words. A "smarter" scheme with better recall-per-bit (asymmetric scoring via a lookup table) lost — because a table lookup is a gather the CPU can't vectorize, while popcount stays one instruction. The chart makes it concrete: the same job, three different per-vector instructions, and throughput tracks the instruction, not the math. The representation that wins is the one whose op is a single SIMD instruction — the law that decided nearly every later call. (010–012)
Same job, different instruction. The gather can't vectorize; popcount is one instruction over 8 words; the SIMD lookup does 16 lookups per instruction on codes 16× smaller — 9× the popcount scan. Single-thread, random codes.
Making the bits count
One bit/dim is brutal, so don't waste bits. A random rotation spreads information evenly across dimensions (free recall). The bigger lever is the residual, and it's exactly where the IVF assumption pays off: inside a cluster the vectors hug the cell's centroid, so their raw sign bits just re-encode that shared center — useless for telling them apart. IVF already gives every one of its thousands of cells a k-means centroid; subtract the cell's own centroid and the bits encode each vector's deviation instead. +3–9 recall points, free. Toggle it: (026, 046)
Raw: every point sits in the same quadrant, so every 2-bit code is identical. The bits carry the shared center, not what makes points different.
Recall is a dial, not a constant
Every experiment trades recall against speed and bytes, so I track recall at each operating point rather than chasing one number. The funnel's recall is set by how wide the shortlist is and how good the codes are — rotation and residual buy recall for free; widening the shortlist buys it for QPS. (051)
Throughput: tiling
The scan is bandwidth-bound, so reuse each loaded code: compare it against a tile of queries while it's hot in cache, before moving on. More comparisons per byte read; the base streams once per tile instead of per query. The tile has to fit in cache (L1/L2), which bounds how big it can get — tile≈8 was the sweet spot here. +73% QPS, identical results. (016, 038) — but only while the scan is bandwidth-bound: when the whole code store fits in cache (as the small Matryoshka-256 corpus did), tiling flips to a cost (066).
The frontier: recall vs QPS
Funnel + rotation + residual + tiling gives a clean frontier. Against Product Quantization (the textbook quantizer, fewer bytes), the funnel wins ~7× on QPS at equal recall — lever two again (PQ's op is a gather). (051, 054)
Up-and-right is better. The funnel holds ~0.99 recall at ~900 QPS; PQ reaches the same recall only near ~140 QPS — 7× slower, because its ADC is a gather, not a popcount.
The exact corner: layout + adaptive pruning
The funnel trades a little recall for huge speed. But sometimes you need exact (recall ~1.0) and the lossy codes are off the table — you have to scan full f32. Even there the same two levers apply, and they bought 4×. First, PDX: store vectors column-major (all dimension-0s together, then all dimension-1s…) instead of row-major. Same data, but the scan now autovectorizes cleanly — a free 2×. Then ADSampling: walk a vector's dimensions in rotated order and stop early the moment its partial distance can't beat the current k-th best — most vectors get rejected after a fraction of their dimensions. Together: 9 → 37 QPS at 0.999 recall. The funnel is still far faster when you can spend recall; this is the corner where you can't. (031, 032, 033)
row-major, full distance every vector
column-major — autovectorizes, free 2×
early-stop a vector once it can't win
4× the naive exact scan, recall still ~1.0
This is the exact regime — full f32, recall ~1.0 — not the funnel. Even here the two levers hold: the layout change is free SIMD, and pruning moves fewer bytes per vector. The funnel is still far faster when you can spend recall; this is the corner where you can't.
Serving without a cliff
A naive one-query-at-a-time server cliffs from ms to seconds the moment load exceeds capacity. The carousel has queries ride one shared, continuously-cycling scan — every byte read is shared by everyone aboard. Latency stays bounded: ~5 ms light, ~35 ms at the ~940 QPS ceiling, no cliff. (039–041)
Same throughput ceiling, opposite tails. Past ~500 QPS the naive server cliffs to seconds; the carousel stays bounded (~35 ms at 1000 QPS). Note the log y-axis.
Things I tried — parked or negative
The honest other half of the log. Some of these work and are simply off-mission; some are dead ends worth knowing about so you don't repeat them.
- Disk-resident vectors — measured the economics (codes in RAM, vectors on SSD = 32× less RAM). But warm disk just becomes page-cache-bound, i.e. back to CPU/bandwidth, and the brief here is max QPS without a cost constraint — so the engine stays fully in RAM. Works, parked. (045)
- PQ-prune, a 3-tier funnel — binary → a product-quantization rerank → exact, cutting reads ~8× (1000 → 128 candidates) at 0.989 recall. A real win, but it only pays off in the disk regime above, so it's parked with it. (060)
- Query-adaptive funnel width — set each query's shortlist from its own Hamming-gap "certificate" instead of a fixed 500. Cuts mean rerank work ~34% at matched recall — but in RAM the scan dominates, so it barely moves QPS; it's a latency lever for the disk regime. (048)
- RaBitQ unbiased estimator — the best recall-per-bit of anything tried (0.61 vs 0.46). But its scan is 40× slower: the estimator isn't a single SIMD instruction. Lever two, in one data point. (028)
- ITQ learned rotation — learning the rotation (vs a random one) adds ~1.5 recall points, but the residual already buys +3–9 for free, and a dense learned rotation costs more to apply. Not worth it at 1024 bits — revised at 256 bits in the epilogue below, where it stacks with residual for +0.3 free points. (055, 066)
- Truncated (prefix) scan — scan only the first 768 of 1024 bits for +23% QPS at 0.987 recall; a rotation is what makes the prefix viable (+4.6 pts). A usable low-recall knob, distinct from the bit-floor below which collapsed. (014, 027)
- Dead ends: scalar PQ / OPQ (footprint, not speed — gather-bound), HNSW inside a cell (loses to a flat 1-bit scan at high-D), bit-floor (4× fewer scan bits bought only +8% QPS while recall fell 0.995 → 0.72), bf16/int8 rerank, register-tiling, prefetch (neutral-to-negative). All in /notes.
The model, and scale
After enough data the engine is predictable: throughput ≈ 1.1e9 · cores / N, recall a power law in N, shortlist, and bits — separable axes, accurate to ~1 recall point. And it scales: pushed to 100M codes, scan throughput stays flat across the cache→DRAM transition — compactness is the scaling property. (047, 049)
The surprise
"Popcount always beats gather" held — until I wrote the SIMD version of PQ's lookup: 4-bit codes, an in-register pshufb table (safe portable-SIMD Rust), 16 lookups per instruction. At 8 bytes/vector it stays in cache and runs 2–9× faster than an untiled popcount scan (the green bar above). Gather isn't dead — it had to be vectorized and tiny. Whether it beats the tiled funnel end-to-end is the next thing I'd chase. (059)
The hardware: what to actually buy
The same engine, same data, same flags across seven boxes — three vendors × two families plus an Apple-Silicon ringer, all the advised 8-vCPU instance. Recall stays a flat 0.9952, so the only variables are silicon and price. The winner is decisive: AMD Zen5 does ~2310 QPS — 2.5× the Intel box the engine was tuned on (062).
QPS per $1k/yr, same engine on every box. AMD Zen5 wins both raw and per-dollar — not magic silicon, but 8 real AVX-512 cores at the “8 vCPU” tier vs Intel’s 4 + hyperthreading. Graviton4 has 8 cores too but half-width NEON popcount; Apple M1 is the for-the-LOLs floor. Throughput ≈ physical cores × popcount width.
The funnel is compute-bound on popcount, so throughput is two multiplicative levers — physical cores × popcount width — and the "8 vCPU" label hides both:
- vCPUs aren't cores. At the 8-vCPU tier AMD/Graviton give 8 physical cores; Intel gives 4 + hyperthreading. A scaling sweep settles it: QPS is dead-linear to 4 threads (211→413→850) then the SMT 2nd thread regresses (808). So an AMD vCPU does 289 QPS, an Intel vCPU 117 — half, for the same line item.
- Popcount width is the other lever. Graviton4 has 8 cores too but ~half the QPS — NEON
CNTis 128-bit vs AVX-512VPOPCNTDQ's 512-bit. (Apple M1 is the for-the-LOLs floor.)
That gives a clean buying rule. Throughput ≈ C·P/N (per-core rate × physical cores ÷ vectors scanned), so QPS/$ ≈ C·P / (N·price) → buy the densest AVX-512 physical cores per dollar. And since the hot scan is only 128 MB (the f32 rerank store is ~4 GB, still tiny), you want minimal RAM — compute-optimized, not memory. The 12–28 GB of slack on each box is the IVF lever: RAM holds all the cells, cores scan only the few a query touches — so size RAM for the corpus and cores for the QPS, independently.
How far one box scales — the bandwidth wall
Take the winner (AMD Zen5) and push one family across its whole range, 8 → 192 cores, same engine and recall (0.9952). The funnel is compute-bound only at the small end (~289 QPS/core); then DDR5-6400 saturates and cores stop converting into throughput (064).
QPS climbs ~linearly with cores while compute-bound, then DDR5-6400 saturates. Sub-full-socket boxes share the socket’s 12 channels and pin at ~13.6k (64–96c); only the full-socket 48xlarge gets them all and tops out at ~19.9k (≈358 GB/s, 58% of the 614 GB/s peak). Beyond that the memory bus, not the cores, is the ceiling.
It's a two-tier wall. A sub-full-socket box shares the socket's 12 memory channels with co-tenants and pins at ~13.6k QPS (64c and 96c land at the same number — more cores, no more throughput, ~245 GB/s). Only the full-socket 48xlarge gets all the channels and breaks out to 19,881 QPS (~358 GB/s, 58% of the 614 GB/s peak) — and even that is sublinear (2× cores → 1.47× QPS). The reason is simple: at scale every query streams the whole 128 MB of codes, so once the bus is full, that's the ceiling — achievable_BW ÷ 18 MB/query. Past the knee it's the memory bus, not the cores. (Which is also why small instances win perf-per-buck — scale out, not up.)
The best, in one line
A 1-bit binary funnel — rotation + residual to make the bits count, tiling for throughput (when bandwidth-bound), a carousel so latency never cliffs — does ~0.995 recall in 128 MB, at ~960 QPS on the Intel box it was tuned on and ~2310 QPS on the best-value AMD Zen5 silicon (same engine, the difference is physical cores × popcount width). On Matryoshka-256 codes the same funnel reaches 0.991 recall at ~5,000 QPS in 32 MB. The deepest lesson, end to end: on a CPU, what wins — the representation and the box — is whatever makes the per-vector operation a single wide SIMD instruction.
The field guide — every trick, its price, and when it holds
Sixty-six experiments compress to this table. The last column is the one that matters: almost every verdict is conditional on the regime — how many bytes the scan streams, and whether that fits in cache. Shrink the codes and the verdicts flip (it happened twice in this study: 003→007, and again at 066).
| trick | effect (measured) | holds when |
|---|---|---|
| 1-bit funnel + exact rerank (009) | 62× QPS at ~equal recall — the engine | always (the core design) |
| Random rotation (026) | +2–5 recall pts, free | always (needs pow-2 dims for FWHT) |
| Residual / centroid subtraction (046) | +3–9 recall pts, free | inside an IVF cell (needs a centroid) |
| Learned rotation (ITQ) (055, 066) | +0.3–1 recall pts, free at query time | low bit budgets (256); marginal at 1024 |
| Rerank width C (009, 065) | recall 0.95→0.998, ~linear QPS cost | always — the operating-point dial |
| Query tiling (016, 038, 066) | +73% QPS at 128 MB codes; −24% when the corpus fits in L3 | bandwidth-bound only — re-sweep per machine and per corpus size (codes × N, not just bits) |
| Intra-query parallelism / carousel (021, 039–041) | 4.4× lower p50; no latency cliff under burst | serving; fan-out must shrink as load grows |
| Prefix scan (014, 027) | +23% QPS at −0.8 recall pts | rotated codes; a knob, not a default |
| Matryoshka truncation (065) | 4× smaller codes, recall holds (0.99+) | only MRL-trained embeddings — a non-MRL prefix collapses to 0.72 |
| PDX layout + ADSampling (032, 033) | 4× faster exact scan @ recall ≈ 1.0 | the exact tier only |
| SIMD-ADC PQ4 (059) | 2–9× vs untiled popcount, 8 B/vec | vectorized and cache-tiny; the one gather that wins |
| Scalar PQ/OPQ (054, 056) | ¼ bytes, but 7–15× slower | footprint-bound only — gathers don't vectorize |
| HNSW inside a cell (043) | loses to flat 1-bit scan | only pays at low-D large cells |
| int8/bf16 rerank, register-tiling, prefetch, multi-acc popcount (012, 015, 023, 024) | neutral to −60% | never here — don't fight the autovectorizer |
Three laws survived every experiment:
- The winning representation is the one whose per-vector op is a single wide SIMD instruction (popcount beat every smarter-but-gather scheme, until PQ4 made the gather itself SIMD).
- Every byte reduction re-prices every lever. Faster kernel → memory-bound → batching pays (007). Corpus fits in cache → compute-bound → tiling hurts (066). The boundary is bytes × N vs cache, so the same codes flip verdicts at different corpus sizes — never carry a tuning verdict across regimes.
- Recall loss in a good binary code is ranking error, not identity loss (990k/990k codes unique at 256 bits) — which is why over-retrieve + exact rerank recovers it almost for free, and why the funnel architecture is right.
Epilogue: Matryoshka — the bandwidth wall, attacked from the embedding side
The bandwidth wall (064) said: past the knee, throughput = bus ÷ bytes-per-query. The last arc of the project (065, 066) attacked the bytes from the embedding side, with a model trained to be truncatable — OpenAI's text-embedding-3-large (MRL), sliced to 256 dims → 256-bit codes, 32 B/vector. Earlier, truncating a non-Matryoshka embedding to 256 bits collapsed recall to 0.72 (the bit-floor dead end); a Matryoshka embedding is the counterfactual:
Matryoshka-256 codes (32 B/vec, a 31.7 MB index for 990k vectors): recall dials smoothly to 0.999 because every code is unique — the shortlist just needs to be wide enough. The dashed line is the same 256-bit budget on an embedding not trained for truncation: it caps at ~0.72 no matter the C. The model, not the engine, decides whether 256 bits is enough. (QPS/latency shown are the 065 sweep, tile=8; the final config drops tiling for ~+18% QPS — 0.9907 @ ~4,966 QPS at C=2000, note 066.)
- Recall holds and dials cleanly: 0.947 at C=500 up to 0.998 at C=8000 — no collapse, no plateau. The chosen point: C=2000 → 0.991 recall @ ~5,000 QPS, p50 3.1 ms (16 vCPU Zen5).
- All 990k codes stay unique at 256 bits — recall loss is pure ranking error, exactly what rerank width fixes. (It also means the codes are near-incompressible — entropy per bit is the point.)
- The regime flipped — with a caveat: at 31.7 MB the codes are cache-resident, the scan turns compute-bound, and tiling — +73% on the 1024-bit funnel — now costs 24%. But be honest about the confound: 990k × 32 B is simply a small dataset — the flip is "this corpus fits in L3," not a property of 256-bit codes. At 100M vectors the same codes are 3.2 GB, DRAM-streamed, and tiling should pay again (047 showed exactly this cache→DRAM transition at scale). The real lesson stands either way: every byte reduction re-prices every lever — per corpus size.
- ITQ got un-parked: at 256 bits a learned rotation stacks with the residual for +0.3 recall points, free at query time (055's verdict was a 1024-bit verdict).
A fitting last experiment: the project's two levers — fewer bytes, cheaper instructions — pushed to where the embedding model itself becomes the compression scheme.
Where it stops, honestly
Sixty-six experiments in, the single-node story is characterized: throughput ≈ C·P/N until the bus saturates, recall a separable dial (bits × shortlist × code quality), latency bounded by the carousel, and the buying rule for silicon. The remaining questions are not answerable from a benchmark box, because they depend on a real workload:
- Access patterns — which cells are hot, how skewed the queries are, what's worth caching. Every tiering/placement decision (045, 060) inverts depending on the answer.
- Data size — how big the corpus is, how fast it grows and churns. It decides which side of the cache/DRAM boundary the codes land on — and that boundary flips the tuning verdicts (066) and sets where the bandwidth wall bites (064).
- Model capabilities — the embedding model is now part of the index design. A Matryoshka-trained model made 256-bit codes viable (0.99+ recall); the same truncation on a non-MRL model collapsed to 0.72. What you can store, and how fast you can scan it, depends on what the model was trained to survive (065).
- The recall that actually matters — 0.99 vs 0.999 is a product question (what does a miss cost?), and it moves every operating point in this writeup.
Cluster-scale directions — the IVF router over thousands of cells, elastic placement, RAM⇄disk tiering, latency-driven autoscaling, recall that floats with utilization — are all designable on paper from the numbers here. But designing them without a workload would just be speculation with extra steps. So this is where the study ends: a recursive loop of measure → explain → re-measure, run until the limits of vector search on one box were mapped — and the honest finding that the next limits belong to real usage, not the lab.
→ Browse the full trail: all 66 experiments