In-memory vector search · QPS · latency · recall

Pushing vector search
QPS & latency to the ceiling

A first-principles study of how many top-k vector queries per second a single CPU box can really do over modern embeddings — and what actually moves that number.

It works through the techniques behind state-of-the-art vector search — binary quantization, learned rotations, product quantization, SIMD scan kernels, Matryoshka embeddings — building up from a brute-force baseline, measured the ANN-Benchmarks way (recall / QPS / latency), keeping only what the data justified. Sixty-six numbered experiments later, this is the engine that came out best and the lessons that got there. The full experiment log is at /notes.

The headline numbers — Cohere v3 embeddings, 1M × 1024-dim, one 8-core EC2 box:

0.995
recall@10
~960
QPS
8 ms
p50 latency

~100× the throughput of the brute-force baseline, same recall, 128 MB of index. The same engine goes further on other silicon and other embeddings: ~2,310 QPS on 8-core AMD Zen5 (062), ~19,900 QPS on a full 192-core socket before the memory bus saturates (064), and 0.991 recall @ ~5,000 QPS, p50 3.1 ms on 256-bit Matryoshka codes (16 vCPU; 065, 066).

01how we got here — QPS at held recall
001 · exact brute force9 QPS1×
recall 1.000

scan all 4 GB / query

009 · binary funnel516 QPS57×
recall 0.993

1-bit scan + rerank

038 · + query tiling922 QPS102×
recall 0.997

amortize the base read

051 · + rotation + residual963 QPS107×
recall 0.995

make the bits count

Bar length is QPS (linear). The throughput goes ~107× from 9 to 963 — yet the recall column barely moves (≈ 0.99 throughout). That's the whole point: massive speed, ~no recall lost.

Scope — one IVF cell. This is the within-cell engine. A real index is IVF: k-means splits the corpus into thousands of clusters, each with a centroid, and a coarse router sends a query to just a few of them. Everything here is making the scan inside one cluster as fast as possible — the router that picks clusters is a layer above (out of scope). That assumption matters most for the residual below: each cluster has its own centroid to subtract.

The wall is memory, not compute

Brute force — distance to all 1M vectors per query — does ~9 QPS. The instinct is to speed up the kernel (FMA, AVX-512, batching); I did, and it roughly doubled, then stopped. The bottleneck was never compute. It's memory bandwidth: a query scans the whole base, and at 4 KB/vector that's 4 GB moved from RAM every time. You can't out-compute the bus.

That reframes everything into two levers: send fewer bytes, or spend fewer/cheaper instructions on each byte. The rest of the project is those two. (003–008)

02lever one — fewer bytes per query
float32 — 4 KB/vec4.0 GB/query

memory-bound: 4 GB / query

int8 — 1 KB/vec1.0 GB/query

4× smaller

1-bit code — 128 B/vec128 MB/query

32× smaller — fits the bus

Bar length is bytes read per query (linear). Scanning 1M vectors moves the whole base from RAM every query — no kernel trick moves 4 GB faster than the bus, so the lever is fewer bytes. 1-bit is a 32× cut.

The funnel

Fewest bytes: keep one sign bit per dimension — 32× smaller. Compare two of these codes with Hamming distance (XOR, count differing bits). It's coarse, so use it as a funnel: scan all N with the 1-bit code for a few-hundred shortlist, then re-rank only those with the real vectors. Recall ~0.99, QPS 9 → ~500 — the biggest single step. (009)

03the funnel — coarse filters, exact ranks
1,000,000 vectors

1-bit Hamming scan · 128 B/vec · popcount

~500 candidates

exact f32 rerank · 4 KB/vec · only these

top 10

the answer

Every vector gets the cheap 1-bit comparison; only the few hundred survivors pay for the exact one. You read 4 KB/vector for 500 vectors, not a million.

Lever two: the instruction matters more than the math

Counting set bits is VPOPCNTDQ — one SIMD instruction over eight 64-bit words. A "smarter" scheme with better recall-per-bit (asymmetric scoring via a lookup table) lost — because a table lookup is a gather the CPU can't vectorize, while popcount stays one instruction. The chart makes it concrete: the same job, three different per-vector instructions, and throughput tracks the instruction, not the math. The representation that wins is the one whose op is a single SIMD instruction — the law that decided nearly every later call. (010–012)

04same job, different instruction (scan throughput)
scalar PQ — table gather0.14 Gcmp/s
binary — popcount (VPOPCNTDQ)0.20 Gcmp/s
SIMD PQ4 — pshufb (8 B/vec)1.83 Gcmp/s

Same job, different instruction. The gather can't vectorize; popcount is one instruction over 8 words; the SIMD lookup does 16 lookups per instruction on codes 16× smaller — the popcount scan. Single-thread, random codes.

Making the bits count

One bit/dim is brutal, so don't waste bits. A random rotation spreads information evenly across dimensions (free recall). The bigger lever is the residual, and it's exactly where the IVF assumption pays off: inside a cluster the vectors hug the cell's centroid, so their raw sign bits just re-encode that shared center — useless for telling them apart. IVF already gives every one of its thousands of cells a k-means centroid; subtract the cell's own centroid and the bits encode each vector's deviation instead. +3–9 recall points, free. Toggle it: (026, 046)

05residual — subtract the cell's center
distinct codes: 2/6
cell centroid+++-+++++-++

Raw: every point sits in the same quadrant, so every 2-bit code is identical. The bits carry the shared center, not what makes points different.

Recall is a dial, not a constant

Every experiment trades recall against speed and bytes, so I track recall at each operating point rather than chasing one number. The funnel's recall is set by how wide the shortlist is and how good the codes are — rotation and residual buy recall for free; widening the shortlist buys it for QPS. (051)

Throughput: tiling

The scan is bandwidth-bound, so reuse each loaded code: compare it against a tile of queries while it's hot in cache, before moving on. More comparisons per byte read; the base streams once per tile instead of per query. The tile has to fit in cache (L1/L2), which bounds how big it can get — tile≈8 was the sweet spot here. +73% QPS, identical results. (016, 038) — but only while the scan is bandwidth-bound: when the whole code store fits in cache (as the small Matryoshka-256 corpus did), tiling flips to a cost (066).

The frontier: recall vs QPS

Funnel + rotation + residual + tiling gives a clean frontier. Against Product Quantization (the textbook quantizer, fewer bytes), the funnel wins ~7× on QPS at equal recall — lever two again (PQ's op is a gather). (051, 054)

06recall vs QPS — funnel vs PQ
0.900.951.00501002505001000QPS (log) →recall@10 →M=16M=32M=64C=200C=500C=1000

Up-and-right is better. The funnel holds ~0.99 recall at ~900 QPS; PQ reaches the same recall only near ~140 QPS — 7× slower, because its ADC is a gather, not a popcount.

The exact corner: layout + adaptive pruning

The funnel trades a little recall for huge speed. But sometimes you need exact (recall ~1.0) and the lossy codes are off the table — you have to scan full f32. Even there the same two levers apply, and they bought . First, PDX: store vectors column-major (all dimension-0s together, then all dimension-1s…) instead of row-major. Same data, but the scan now autovectorizes cleanly — a free 2×. Then ADSampling: walk a vector's dimensions in rotated order and stop early the moment its partial distance can't beat the current k-th best — most vectors get rejected after a fraction of their dimensions. Together: 9 → 37 QPS at 0.999 recall. The funnel is still far faster when you can spend recall; this is the corner where you can't. (031, 032, 033)

07exact scan — layout + pruning (recall ~1.0)
naive f32 scan9.2 QPS1.0×
recall 1.000

row-major, full distance every vector

+ PDX vertical layout18.0 QPS2.0×
recall 1.000

column-major — autovectorizes, free 2×

+ ADSampling pruning37.4 QPS4.1×
recall 0.999

early-stop a vector once it can't win

4× the naive exact scan, recall still ~1.0

This is the exact regime — full f32, recall ~1.0 — not the funnel. Even here the two levers hold: the layout change is free SIMD, and pruning moves fewer bytes per vector. The funnel is still far faster when you can spend recall; this is the corner where you can't.

Serving without a cliff

A naive one-query-at-a-time server cliffs from ms to seconds the moment load exceeds capacity. The carousel has queries ride one shared, continuously-cycling scan — every byte read is shared by everyone aboard. Latency stays bounded: ~5 ms light, ~35 ms at the ~940 QPS ceiling, no cliff. (039–041)

08latency vs load — naive cliffs, carousel holds
per-query (naive)carousel
10ms100ms1000ms2006001000offered QPS →p50 latency (log) →

Same throughput ceiling, opposite tails. Past ~500 QPS the naive server cliffs to seconds; the carousel stays bounded (~35 ms at 1000 QPS). Note the log y-axis.

Things I tried — parked or negative

The honest other half of the log. Some of these work and are simply off-mission; some are dead ends worth knowing about so you don't repeat them.

The model, and scale

After enough data the engine is predictable: throughput ≈ 1.1e9 · cores / N, recall a power law in N, shortlist, and bits — separable axes, accurate to ~1 recall point. And it scales: pushed to 100M codes, scan throughput stays flat across the cache→DRAM transition — compactness is the scaling property. (047, 049)

The surprise

"Popcount always beats gather" held — until I wrote the SIMD version of PQ's lookup: 4-bit codes, an in-register pshufb table (safe portable-SIMD Rust), 16 lookups per instruction. At 8 bytes/vector it stays in cache and runs 2–9× faster than an untiled popcount scan (the green bar above). Gather isn't dead — it had to be vectorized and tiny. Whether it beats the tiled funnel end-to-end is the next thing I'd chase. (059)

The hardware: what to actually buy

The same engine, same data, same flags across seven boxes — three vendors × two families plus an Apple-Silicon ringer, all the advised 8-vCPU instance. Recall stays a flat 0.9952, so the only variables are silicon and price. The winner is decisive: AMD Zen5 does ~2310 QPS — 2.5× the Intel box the engine was tuned on (062).

09perf per dollar across 7 boxes (recall held constant)
recall 0.9952 — held constant
c8a.2xlarge · 8 vCPU (8 phys) · AVX-512612$3,776/yr
m8a.2xlarge · 8 vCPU (8 phys) · AVX-512542$4,265/yr
c8g.2xlarge · 8 vCPU (8 phys) · NEON353$2,794/yr
m8g.2xlarge · 8 vCPU (8 phys) · NEON312$3,145/yr
c8i.2xlarge · 8 vCPU (4 phys) · AVX-512284$3,283/yr
m8i.2xlarge · 8 vCPU (4 phys) · AVX-512249$3,709/yr
mac2.metal · 8 vCPU (8 phys) · NEON125$5,681/yr

QPS per $1k/yr, same engine on every box. AMD Zen5 wins both raw and per-dollar — not magic silicon, but 8 real AVX-512 cores at the “8 vCPU” tier vs Intel’s 4 + hyperthreading. Graviton4 has 8 cores too but half-width NEON popcount; Apple M1 is the for-the-LOLs floor. Throughput ≈ physical cores × popcount width.

The funnel is compute-bound on popcount, so throughput is two multiplicative levers — physical cores × popcount width — and the "8 vCPU" label hides both:

That gives a clean buying rule. Throughput ≈ C·P/N (per-core rate × physical cores ÷ vectors scanned), so QPS/$ ≈ C·P / (N·price) → buy the densest AVX-512 physical cores per dollar. And since the hot scan is only 128 MB (the f32 rerank store is ~4 GB, still tiny), you want minimal RAM — compute-optimized, not memory. The 12–28 GB of slack on each box is the IVF lever: RAM holds all the cells, cores scan only the few a query touches — so size RAM for the corpus and cores for the QPS, independently.

How far one box scales — the bandwidth wall

Take the winner (AMD Zen5) and push one family across its whole range, 8 → 192 cores, same engine and recall (0.9952). The funnel is compute-bound only at the small end (~289 QPS/core); then DDR5-6400 saturates and cores stop converting into throughput (064).

10c8a scaling — QPS vs cores, until DDR5 saturates
5k10k15k20k81632486496192physical cores (log) →funnel QPS →compute-boundshared-socket plateau ~13.6kfull socket 19.9k

QPS climbs ~linearly with cores while compute-bound, then DDR5-6400 saturates. Sub-full-socket boxes share the socket’s 12 channels and pin at ~13.6k (64–96c); only the full-socket 48xlarge gets them all and tops out at ~19.9k (≈358 GB/s, 58% of the 614 GB/s peak). Beyond that the memory bus, not the cores, is the ceiling.

It's a two-tier wall. A sub-full-socket box shares the socket's 12 memory channels with co-tenants and pins at ~13.6k QPS (64c and 96c land at the same number — more cores, no more throughput, ~245 GB/s). Only the full-socket 48xlarge gets all the channels and breaks out to 19,881 QPS (~358 GB/s, 58% of the 614 GB/s peak) — and even that is sublinear (2× cores → 1.47× QPS). The reason is simple: at scale every query streams the whole 128 MB of codes, so once the bus is full, that's the ceiling — achievable_BW ÷ 18 MB/query. Past the knee it's the memory bus, not the cores. (Which is also why small instances win perf-per-buck — scale out, not up.)

The best, in one line

A 1-bit binary funnel — rotation + residual to make the bits count, tiling for throughput (when bandwidth-bound), a carousel so latency never cliffs — does ~0.995 recall in 128 MB, at ~960 QPS on the Intel box it was tuned on and ~2310 QPS on the best-value AMD Zen5 silicon (same engine, the difference is physical cores × popcount width). On Matryoshka-256 codes the same funnel reaches 0.991 recall at ~5,000 QPS in 32 MB. The deepest lesson, end to end: on a CPU, what wins — the representation and the box — is whatever makes the per-vector operation a single wide SIMD instruction.

The field guide — every trick, its price, and when it holds

Sixty-six experiments compress to this table. The last column is the one that matters: almost every verdict is conditional on the regime — how many bytes the scan streams, and whether that fits in cache. Shrink the codes and the verdicts flip (it happened twice in this study: 003→007, and again at 066).

trickeffect (measured)holds when
1-bit funnel + exact rerank (009)62× QPS at ~equal recall — the enginealways (the core design)
Random rotation (026)+2–5 recall pts, freealways (needs pow-2 dims for FWHT)
Residual / centroid subtraction (046)+3–9 recall pts, freeinside an IVF cell (needs a centroid)
Learned rotation (ITQ) (055, 066)+0.3–1 recall pts, free at query timelow bit budgets (256); marginal at 1024
Rerank width C (009, 065)recall 0.95→0.998, ~linear QPS costalways — the operating-point dial
Query tiling (016, 038, 066)+73% QPS at 128 MB codes; −24% when the corpus fits in L3bandwidth-bound only — re-sweep per machine and per corpus size (codes × N, not just bits)
Intra-query parallelism / carousel (021, 039–041)4.4× lower p50; no latency cliff under burstserving; fan-out must shrink as load grows
Prefix scan (014, 027)+23% QPS at −0.8 recall ptsrotated codes; a knob, not a default
Matryoshka truncation (065)4× smaller codes, recall holds (0.99+)only MRL-trained embeddings — a non-MRL prefix collapses to 0.72
PDX layout + ADSampling (032, 033)4× faster exact scan @ recall ≈ 1.0the exact tier only
SIMD-ADC PQ4 (059)2–9× vs untiled popcount, 8 B/vecvectorized and cache-tiny; the one gather that wins
Scalar PQ/OPQ (054, 056)¼ bytes, but 7–15× slowerfootprint-bound only — gathers don't vectorize
HNSW inside a cell (043)loses to flat 1-bit scanonly pays at low-D large cells
int8/bf16 rerank, register-tiling, prefetch, multi-acc popcount (012, 015, 023, 024)neutral to −60%never here — don't fight the autovectorizer

Three laws survived every experiment:

  1. The winning representation is the one whose per-vector op is a single wide SIMD instruction (popcount beat every smarter-but-gather scheme, until PQ4 made the gather itself SIMD).
  2. Every byte reduction re-prices every lever. Faster kernel → memory-bound → batching pays (007). Corpus fits in cache → compute-bound → tiling hurts (066). The boundary is bytes × N vs cache, so the same codes flip verdicts at different corpus sizes — never carry a tuning verdict across regimes.
  3. Recall loss in a good binary code is ranking error, not identity loss (990k/990k codes unique at 256 bits) — which is why over-retrieve + exact rerank recovers it almost for free, and why the funnel architecture is right.

Epilogue: Matryoshka — the bandwidth wall, attacked from the embedding side

The bandwidth wall (064) said: past the knee, throughput = bus ÷ bytes-per-query. The last arc of the project (065, 066) attacked the bytes from the embedding side, with a model trained to be truncatable — OpenAI's text-embedding-3-large (MRL), sliced to 256 dims → 256-bit codes, 32 B/vector. Earlier, truncating a non-Matryoshka embedding to 256 bits collapsed recall to 0.72 (the bit-floor dead end); a Matryoshka embedding is the counterfactual:

11matryoshka-256 — recall vs shortlist, against the bit-floor
0.9907
recall@10
4,227
QPS
3.09 ms
p50
3.15 ms
p99
0.700.800.901.005001000200040008000rerank shortlist C (log) →recall@10 →non-Matryoshka @ 256 bits ≈ 0.72 (bit-floor)0.9907

Matryoshka-256 codes (32 B/vec, a 31.7 MB index for 990k vectors): recall dials smoothly to 0.999 because every code is unique — the shortlist just needs to be wide enough. The dashed line is the same 256-bit budget on an embedding not trained for truncation: it caps at ~0.72 no matter the C. The model, not the engine, decides whether 256 bits is enough. (QPS/latency shown are the 065 sweep, tile=8; the final config drops tiling for ~+18% QPS — 0.9907 @ ~4,966 QPS at C=2000, note 066.)

A fitting last experiment: the project's two levers — fewer bytes, cheaper instructions — pushed to where the embedding model itself becomes the compression scheme.

Where it stops, honestly

Sixty-six experiments in, the single-node story is characterized: throughput ≈ C·P/N until the bus saturates, recall a separable dial (bits × shortlist × code quality), latency bounded by the carousel, and the buying rule for silicon. The remaining questions are not answerable from a benchmark box, because they depend on a real workload:

Cluster-scale directions — the IVF router over thousands of cells, elastic placement, RAM⇄disk tiering, latency-driven autoscaling, recall that floats with utilization — are all designable on paper from the numbers here. But designing them without a workload would just be speculation with extra steps. So this is where the study ends: a recursive loop of measure → explain → re-measure, run until the limits of vector search on one box were mapped — and the honest finding that the next limits belong to real usage, not the lab.

→ Browse the full trail: all 66 experiments