In-memory vector search · QPS · latency · recall

Pushing vector search
QPS & latency to the ceiling

A first-principles study of how many top-k vector queries per second a single CPU box can really do over modern embeddings — and what actually moves that number.

It works through the techniques behind state-of-the-art vector search — binary quantization, learned rotations, product quantization, SIMD scan kernels, Matryoshka embeddings — building up from a brute-force baseline, measured the ANN-Benchmarks way (recall / QPS / latency), keeping only what the data justified. Sixty-six numbered experiments later, this is the engine that came out best and the lessons that got there. The full experiment log is at /notes.

The headline numbers — Cohere v3 embeddings, 1M × 1024-dim, one 8-core EC2 box:

0.995

recall@10

~960

QPS

8 ms

p50 latency

~100× the throughput of the brute-force baseline, same recall, 128 MB of index. The same engine goes further on other silicon and other embeddings: ~2,310 QPS on 8-core AMD Zen5 (062), ~19,900 QPS on a full 192-core socket before the memory bus saturates (064), and 0.991 recall @ ~5,000 QPS, p50 3.1 ms on 256-bit Matryoshka codes (16 vCPU; 065, 066).

01how we got here — QPS at held recall

001 · exact brute force9 QPS1×

recall 1.000

scan all 4 GB / query

009 · binary funnel516 QPS57×

recall 0.993

1-bit scan + rerank

038 · + query tiling922 QPS102×

recall 0.997

amortize the base read

051 · + rotation + residual963 QPS107×

recall 0.995

make the bits count

Bar length is QPS (linear). The throughput goes ~107× from 9 to 963 — yet the recall column barely moves (≈ 0.99 throughout). That's the whole point: massive speed, ~no recall lost.

Scope — one IVF cell. This is the within-cell engine. A real index is IVF: k-means splits the corpus into thousands of clusters, each with a centroid, and a coarse router sends a query to just a few of them. Everything here is making the scan inside one cluster as fast as possible — the router that picks clusters is a layer above (out of scope). That assumption matters most for the residual below: each cluster has its own centroid to subtract.

The wall is memory, not compute

Brute force — distance to all 1M vectors per query — does ~9 QPS. The instinct is to speed up the kernel (FMA, AVX-512, batching); I did, and it roughly doubled, then stopped. The bottleneck was never compute. It's memory bandwidth: a query scans the whole base, and at 4 KB/vector that's 4 GB moved from RAM every time. You can't out-compute the bus.

That reframes everything into two levers: send fewer bytes, or spend fewer/cheaper instructions on each byte. The rest of the project is those two. (003–008)

02lever one — fewer bytes per query

float32 — 4 KB/vec4.0 GB/query

memory-bound: 4 GB / query

int8 — 1 KB/vec1.0 GB/query

4× smaller

1-bit code — 128 B/vec128 MB/query

32× smaller — fits the bus

Bar length is bytes read per query (linear). Scanning 1M vectors moves the whole base from RAM every query — no kernel trick moves 4 GB faster than the bus, so the lever is fewer bytes. 1-bit is a 32× cut.

The funnel

Fewest bytes: keep one sign bit per dimension — 32× smaller. Compare two of these codes with Hamming distance (XOR, count differing bits). It's coarse, so use it as a funnel: scan all N with the 1-bit code for a few-hundred shortlist, then re-rank only those with the real vectors. Recall ~0.99, QPS 9 → ~500 — the biggest single step. (009)

03the funnel — coarse filters, exact ranks

1,000,000 vectors

1-bit Hamming scan · 128 B/vec · popcount

↓

~500 candidates

exact f32 rerank · 4 KB/vec · only these

↓

top 10

the answer

Every vector gets the cheap 1-bit comparison; only the few hundred survivors pay for the exact one. You read 4 KB/vector for 500 vectors, not a million.

Lever two: the instruction matters more than the math

Counting set bits is VPOPCNTDQ — one SIMD instruction over eight 64-bit words. A "smarter" scheme with better recall-per-bit (asymmetric scoring via a lookup table) lost — because a table lookup is a gather the CPU can't vectorize, while popcount stays one instruction. The chart makes it concrete: the same job, three different per-vector instructions, and throughput tracks the instruction, not the math. The representation that wins is the one whose op is a single SIMD instruction — the law that decided nearly every later call. (010–012)

04same job, different instruction (scan throughput)

scalar PQ — table gather0.14 Gcmp/s

binary — popcount (VPOPCNTDQ)0.20 Gcmp/s

SIMD PQ4 — pshufb (8 B/vec)1.83 Gcmp/s

Same job, different instruction. The gather can't vectorize; popcount is one instruction over 8 words; the SIMD lookup does 16 lookups per instruction on codes 16× smaller — 9× the popcount scan. Single-thread, random codes.

Making the bits count

One bit/dim is brutal, so don't waste bits. A random rotation spreads information evenly across dimensions (free recall). The bigger lever is the residual, and it's exactly where the IVF assumption pays off: inside a cluster the vectors hug the cell's centroid, so their raw sign bits just re-encode that shared center — useless for telling them apart. IVF already gives every one of its thousands of cells a k-means centroid; subtract the cell's own centroid and the bits encode each vector's deviation instead. +3–9 recall points, free. Toggle it: (026, 046)

05residual — subtract the cell's center

distinct codes: 2/6

Raw: every point sits in the same quadrant, so every 2-bit code is identical. The bits carry the shared center, not what makes points different.

Recall is a dial, not a constant

Every experiment trades recall against speed and bytes, so I track recall at each operating point rather than chasing one number. The funnel's recall is set by how wide the shortlist is and how good the codes are — rotation and residual buy recall for free; widening the shortlist buys it for QPS. (051)

Throughput: tiling

The scan is bandwidth-bound, so reuse each loaded code: compare it against a tile of queries while it's hot in cache, before moving on. More comparisons per byte read; the base streams once per tile instead of per query. The tile has to fit in cache (L1/L2), which bounds how big it can get — tile≈8 was the sweet spot here. +73% QPS, identical results. (016, 038) — but only while the scan is bandwidth-bound: when the whole code store fits in cache (as the small Matryoshka-256 corpus did), tiling flips to a cost (066).

The frontier: recall vs QPS

Funnel + rotation + residual + tiling gives a clean frontier. Against Product Quantization (the textbook quantizer, fewer bytes), the funnel wins ~7× on QPS at equal recall — lever two again (PQ's op is a gather). (051, 054)

06recall vs QPS — funnel vs PQ

Up-and-right is better. The funnel holds ~0.99 recall at ~900 QPS; PQ reaches the same recall only near ~140 QPS — 7× slower, because its ADC is a gather, not a popcount.

The exact corner: layout + adaptive pruning

The funnel trades a little recall for huge speed. But sometimes you need exact (recall ~1.0) and the lossy codes are off the table — you have to scan full f32. Even there the same two levers apply, and they bought 4×. First, PDX: store vectors column-major (all dimension-0s together, then all dimension-1s…) instead of row-major. Same data, but the scan now autovectorizes cleanly — a free 2×. Then ADSampling: walk a vector's dimensions in rotated order and stop early the moment its partial distance can't beat the current k-th best — most vectors get rejected after a fraction of their dimensions. Together: 9 → 37 QPS at 0.999 recall. The funnel is still far faster when you can spend recall; this is the corner where you can't. (031, 032, 033)

07exact scan — layout + pruning (recall ~1.0)

naive f32 scan9.2 QPS1.0×

recall 1.000

row-major, full distance every vector

+ PDX vertical layout18.0 QPS2.0×

recall 1.000

column-major — autovectorizes, free 2×

+ ADSampling pruning37.4 QPS4.1×

recall 0.999

early-stop a vector once it can't win

4× the naive exact scan, recall still ~1.0

This is the exact regime — full f32, recall ~1.0 — not the funnel. Even here the two levers hold: the layout change is free SIMD, and pruning moves fewer bytes per vector. The funnel is still far faster when you can spend recall; this is the corner where you can't.

Serving without a cliff

A naive one-query-at-a-time server cliffs from ms to seconds the moment load exceeds capacity. The carousel has queries ride one shared, continuously-cycling scan — every byte read is shared by everyone aboard. Latency stays bounded: ~5 ms light, ~35 ms at the ~940 QPS ceiling, no cliff. (039–041)

08latency vs load — naive cliffs, carousel holds

per-query (naive)carousel

Same throughput ceiling, opposite tails. Past ~500 QPS the naive server cliffs to seconds; the carousel stays bounded (~35 ms at 1000 QPS). Note the log y-axis.

Things I tried — parked or negative

The honest other half of the log. Some of these work and are simply off-mission; some are dead ends worth knowing about so you don't repeat them.

Disk-resident vectors — measured the economics (codes in RAM, vectors on SSD = 32× less RAM). But warm disk just becomes page-cache-bound, i.e. back to CPU/bandwidth, and the brief here is max QPS without a cost constraint — so the engine stays fully in RAM. Works, parked. (045)
PQ-prune, a 3-tier funnel — binary → a product-quantization rerank → exact, cutting reads ~8× (1000 → 128 candidates) at 0.989 recall. A real win, but it only pays off in the disk regime above, so it's parked with it. (060)
Query-adaptive funnel width — set each query's shortlist from its own Hamming-gap "certificate" instead of a fixed 500. Cuts mean rerank work ~34% at matched recall — but in RAM the scan dominates, so it barely moves QPS; it's a latency lever for the disk regime. (048)
RaBitQ unbiased estimator — the best recall-per-bit of anything tried (0.61 vs 0.46). But its scan is 40× slower: the estimator isn't a single SIMD instruction. Lever two, in one data point. (028)
ITQ learned rotation — learning the rotation (vs a random one) adds ~1.5 recall points, but the residual already buys +3–9 for free, and a dense learned rotation costs more to apply. Not worth it at 1024 bits — revised at 256 bits in the epilogue below, where it stacks with residual for +0.3 free points. (055, 066)
Truncated (prefix) scan — scan only the first 768 of 1024 bits for +23% QPS at 0.987 recall; a rotation is what makes the prefix viable (+4.6 pts). A usable low-recall knob, distinct from the bit-floor below which collapsed. (014, 027)
Dead ends: scalar PQ / OPQ (footprint, not speed — gather-bound), HNSW inside a cell (loses to a flat 1-bit scan at high-D), bit-floor (4× fewer scan bits bought only +8% QPS while recall fell 0.995 → 0.72), bf16/int8 rerank, register-tiling, prefetch (neutral-to-negative). All in /notes.

The model, and scale

After enough data the engine is predictable: throughput ≈ 1.1e9 · cores / N, recall a power law in N, shortlist, and bits — separable axes, accurate to ~1 recall point. And it scales: pushed to 100M codes, scan throughput stays flat across the cache→DRAM transition — compactness is the scaling property. (047, 049)

The surprise

"Popcount always beats gather" held — until I wrote the SIMD version of PQ's lookup: 4-bit codes, an in-register pshufb table (safe portable-SIMD Rust), 16 lookups per instruction. At 8 bytes/vector it stays in cache and runs 2–9× faster than an untiled popcount scan (the green bar above). Gather isn't dead — it had to be vectorized and tiny. Whether it beats the tiled funnel end-to-end is the next thing I'd chase. (059)

The hardware: what to actually buy

The same engine, same data, same flags across seven boxes — three vendors × two families plus an Apple-Silicon ringer, all the advised 8-vCPU instance. Recall stays a flat 0.9952, so the only variables are silicon and price. The winner is decisive: AMD Zen5 does ~2310 QPS — 2.5× the Intel box the engine was tuned on (062).

09perf per dollar across 7 boxes (recall held constant)

recall 0.9952 — held constant

c8a.2xlarge · 8 vCPU (8 phys) · AVX-512612$3,776/yr

m8a.2xlarge · 8 vCPU (8 phys) · AVX-512542$4,265/yr

c8g.2xlarge · 8 vCPU (8 phys) · NEON353$2,794/yr

m8g.2xlarge · 8 vCPU (8 phys) · NEON312$3,145/yr

c8i.2xlarge · 8 vCPU (4 phys) · AVX-512284$3,283/yr

m8i.2xlarge · 8 vCPU (4 phys) · AVX-512249$3,709/yr

mac2.metal · 8 vCPU (8 phys) · NEON125$5,681/yr

QPS per $1k/yr, same engine on every box. AMD Zen5 wins both raw and per-dollar — not magic silicon, but 8 real AVX-512 cores at the “8 vCPU” tier vs Intel’s 4 + hyperthreading. Graviton4 has 8 cores too but half-width NEON popcount; Apple M1 is the for-the-LOLs floor. Throughput ≈ physical cores × popcount width.

The funnel is compute-bound on popcount, so throughput is two multiplicative levers — physical cores × popcount width — and the "8 vCPU" label hides both:

vCPUs aren't cores. At the 8-vCPU tier AMD/Graviton give 8 physical cores; Intel gives 4 + hyperthreading. A scaling sweep settles it: QPS is dead-linear to 4 threads (211→413→850) then the SMT 2nd thread regresses (808). So an AMD vCPU does 289 QPS, an Intel vCPU 117 — half, for the same line item.
Popcount width is the other lever. Graviton4 has 8 cores too but ~half the QPS — NEON CNT is 128-bit vs AVX-512 VPOPCNTDQ's 512-bit. (Apple M1 is the for-the-LOLs floor.)

That gives a clean buying rule. Throughput ≈ C·P/N (per-core rate × physical cores ÷ vectors scanned), so QPS/$ ≈ C·P / (N·price) → buy the densest AVX-512 physical cores per dollar. And since the hot scan is only 128 MB (the f32 rerank store is ~4 GB, still tiny), you want minimal RAM — compute-optimized, not memory. The 12–28 GB of slack on each box is the IVF lever: RAM holds all the cells, cores scan only the few a query touches — so size RAM for the corpus and cores for the QPS, independently.

How far one box scales — the bandwidth wall

Take the winner (AMD Zen5) and push one family across its whole range, 8 → 192 cores, same engine and recall (0.9952). The funnel is compute-bound only at the small end (~289 QPS/core); then DDR5-6400 saturates and cores stop converting into throughput (064).

10c8a scaling — QPS vs cores, until DDR5 saturates

QPS climbs ~linearly with cores while compute-bound, then DDR5-6400 saturates. Sub-full-socket boxes share the socket’s 12 channels and pin at ~13.6k (64–96c); only the full-socket 48xlarge gets them all and tops out at ~19.9k (≈358 GB/s, 58% of the 614 GB/s peak). Beyond that the memory bus, not the cores, is the ceiling.

It's a two-tier wall. A sub-full-socket box shares the socket's 12 memory channels with co-tenants and pins at ~13.6k QPS (64c and 96c land at the same number — more cores, no more throughput, ~245 GB/s). Only the full-socket 48xlarge gets all the channels and breaks out to 19,881 QPS (~358 GB/s, 58% of the 614 GB/s peak) — and even that is sublinear (2× cores → 1.47× QPS). The reason is simple: at scale every query streams the whole 128 MB of codes, so once the bus is full, that's the ceiling — achievable_BW ÷ 18 MB/query. Past the knee it's the memory bus, not the cores. (Which is also why small instances win perf-per-buck — scale out, not up.)

The best, in one line

A 1-bit binary funnel — rotation + residual to make the bits count, tiling for throughput (when bandwidth-bound), a carousel so latency never cliffs — does ~0.995 recall in 128 MB, at ~960 QPS on the Intel box it was tuned on and ~2310 QPS on the best-value AMD Zen5 silicon (same engine, the difference is physical cores × popcount width). On Matryoshka-256 codes the same funnel reaches 0.991 recall at ~5,000 QPS in 32 MB. The deepest lesson, end to end: on a CPU, what wins — the representation and the box — is whatever makes the per-vector operation a single wide SIMD instruction.

The field guide — every trick, its price, and when it holds

Sixty-six experiments compress to this table. The last column is the one that matters: almost every verdict is conditional on the regime — how many bytes the scan streams, and whether that fits in cache. Shrink the codes and the verdicts flip (it happened twice in this study: 003→007, and again at 066).

trick	effect (measured)	holds when
1-bit funnel + exact rerank (009)	62× QPS at ~equal recall — the engine	always (the core design)
Random rotation (026)	+2–5 recall pts, free	always (needs pow-2 dims for FWHT)
Residual / centroid subtraction (046)	+3–9 recall pts, free	inside an IVF cell (needs a centroid)
Learned rotation (ITQ) (055, 066)	+0.3–1 recall pts, free at query time	low bit budgets (256); marginal at 1024
Rerank width C (009, 065)	recall 0.95→0.998, ~linear QPS cost	always — the operating-point dial
Query tiling (016, 038, 066)	+73% QPS at 128 MB codes; −24% when the corpus fits in L3	bandwidth-bound only — re-sweep per machine and per corpus size (codes × N, not just bits)
Intra-query parallelism / carousel (021, 039–041)	4.4× lower p50; no latency cliff under burst	serving; fan-out must shrink as load grows
Prefix scan (014, 027)	+23% QPS at −0.8 recall pts	rotated codes; a knob, not a default
Matryoshka truncation (065)	4× smaller codes, recall holds (0.99+)	only MRL-trained embeddings — a non-MRL prefix collapses to 0.72
PDX layout + ADSampling (032, 033)	4× faster exact scan @ recall ≈ 1.0	the exact tier only
SIMD-ADC PQ4 (059)	2–9× vs untiled popcount, 8 B/vec	vectorized and cache-tiny; the one gather that wins
Scalar PQ/OPQ (054, 056)	¼ bytes, but 7–15× slower	footprint-bound only — gathers don't vectorize
HNSW inside a cell (043)	loses to flat 1-bit scan	only pays at low-D large cells
int8/bf16 rerank, register-tiling, prefetch, multi-acc popcount (012, 015, 023, 024)	neutral to −60%	never here — don't fight the autovectorizer

Three laws survived every experiment:

The winning representation is the one whose per-vector op is a single wide SIMD instruction (popcount beat every smarter-but-gather scheme, until PQ4 made the gather itself SIMD).
Every byte reduction re-prices every lever. Faster kernel → memory-bound → batching pays (007). Corpus fits in cache → compute-bound → tiling hurts (066). The boundary is bytes × N vs cache, so the same codes flip verdicts at different corpus sizes — never carry a tuning verdict across regimes.
Recall loss in a good binary code is ranking error, not identity loss (990k/990k codes unique at 256 bits) — which is why over-retrieve + exact rerank recovers it almost for free, and why the funnel architecture is right.

Epilogue: Matryoshka — the bandwidth wall, attacked from the embedding side

The bandwidth wall (064) said: past the knee, throughput = bus ÷ bytes-per-query. The last arc of the project (065, 066) attacked the bytes from the embedding side, with a model trained to be truncatable — OpenAI's text-embedding-3-large (MRL), sliced to 256 dims → 256-bit codes, 32 B/vector. Earlier, truncating a non-Matryoshka embedding to 256 bits collapsed recall to 0.72 (the bit-floor dead end); a Matryoshka embedding is the counterfactual:

11matryoshka-256 — recall vs shortlist, against the bit-floor

rerank C2000

0.9907

recall@10

4,227

QPS

3.09 ms

p50

3.15 ms

p99

Matryoshka-256 codes (32 B/vec, a 31.7 MB index for 990k vectors): recall dials smoothly to 0.999 because every code is unique — the shortlist just needs to be wide enough. The dashed line is the same 256-bit budget on an embedding not trained for truncation: it caps at ~0.72 no matter the C. The model, not the engine, decides whether 256 bits is enough. (QPS/latency shown are the 065 sweep, tile=8; the final config drops tiling for ~+18% QPS — 0.9907 @ ~4,966 QPS at C=2000, note 066.)

Recall holds and dials cleanly: 0.947 at C=500 up to 0.998 at C=8000 — no collapse, no plateau. The chosen point: C=2000 → 0.991 recall @ ~5,000 QPS, p50 3.1 ms (16 vCPU Zen5).
All 990k codes stay unique at 256 bits — recall loss is pure ranking error, exactly what rerank width fixes. (It also means the codes are near-incompressible — entropy per bit is the point.)
The regime flipped — with a caveat: at 31.7 MB the codes are cache-resident, the scan turns compute-bound, and tiling — +73% on the 1024-bit funnel — now costs 24%. But be honest about the confound: 990k × 32 B is simply a small dataset — the flip is "this corpus fits in L3," not a property of 256-bit codes. At 100M vectors the same codes are 3.2 GB, DRAM-streamed, and tiling should pay again (047 showed exactly this cache→DRAM transition at scale). The real lesson stands either way: every byte reduction re-prices every lever — per corpus size.
ITQ got un-parked: at 256 bits a learned rotation stacks with the residual for +0.3 recall points, free at query time (055's verdict was a 1024-bit verdict).

A fitting last experiment: the project's two levers — fewer bytes, cheaper instructions — pushed to where the embedding model itself becomes the compression scheme.

Where it stops, honestly

Sixty-six experiments in, the single-node story is characterized: throughput ≈ C·P/N until the bus saturates, recall a separable dial (bits × shortlist × code quality), latency bounded by the carousel, and the buying rule for silicon. The remaining questions are not answerable from a benchmark box, because they depend on a real workload:

Access patterns — which cells are hot, how skewed the queries are, what's worth caching. Every tiering/placement decision (045, 060) inverts depending on the answer.
Data size — how big the corpus is, how fast it grows and churns. It decides which side of the cache/DRAM boundary the codes land on — and that boundary flips the tuning verdicts (066) and sets where the bandwidth wall bites (064).
Model capabilities — the embedding model is now part of the index design. A Matryoshka-trained model made 256-bit codes viable (0.99+ recall); the same truncation on a non-MRL model collapsed to 0.72. What you can store, and how fast you can scan it, depends on what the model was trained to survive (065).
The recall that actually matters — 0.99 vs 0.999 is a product question (what does a miss cost?), and it moves every operating point in this writeup.

Cluster-scale directions — the IVF router over thousands of cells, elastic placement, RAM⇄disk tiering, latency-driven autoscaling, recall that floats with utilization — are all designable on paper from the numbers here. But designing them without a workload would just be speculation with extra steps. So this is where the study ends: a recursive loop of measure → explain → re-measure, run until the limits of vector search on one box were mapped — and the honest finding that the next limits belong to real usage, not the lab.

→ Browse the full trail: all 66 experiments