Experiment 062
Hardware cost right-sizing: throughput is physical cores × popcount width
The engine had only ever run on one box (Intel Granite Rapids, the old ~960-QPS headline). This entry runs the same engine across seven instances — three vendors × two families plus an Apple-Silicon ringer — to answer one question: at a fixed spec, which silicon gives the most QPS per dollar?
Method — everything held constant except the silicon
Identical committed code, identical Cohere v3 (1M × 1024) data, identical flags on every
box: the shipped 1-bit funnel (--rotate 2 --residual --batch 8 --rerank 500),
built with -C target-cpu=native so each CPU autovectorizes to its
widest popcount (AVX-512 VPOPCNTDQ on Intel/AMD, NEON CNT on Graviton/Apple —
verified in the disassembly). Every box is the advised .2xlarge (8 vCPU) instance —
exactly the unit AWS sells; the physical-core counts below are just what's under that
8-vCPU label. Because nothing varies but the chip, recall is a constant 0.9952 across
all seven — the comparison is purely QPS, p50, and QPS-per-dollar. ($/hr = us-east-2
on-demand, one region for all, to strip the regional pricing of the AMD boxes — which only
run in Tokyo — from the comparison.)
Result
All are 8-vCPU instances; "phys" is what's under that vCPU label (Intel = 4 cores + SMT). Cost is us-east-2 on-demand, annualized (24×365h).
| instance (8 vCPU) | vendor | vCPU | phys | popcount | funnel QPS | p50 | $/yr | QPS per $1k/yr | QPS/vCPU |
|---|---|---|---|---|---|---|---|---|---|
| c8a.2xlarge | AMD Zen5 | 8 | 8 | AVX-512 | 2310 | 3.1 ms | $3,776 | 612 | 289 |
| m8a.2xlarge | AMD Zen5 | 8 | 8 | AVX-512 | 2313 | 3.1 ms | $4,265 | 542 | 289 |
| c8g.2xlarge | Graviton4 | 8 | 8 | NEON | 986 | 6.6 ms | $2,785 | 354 | 123 |
| m8g.2xlarge | Graviton4 | 8 | 8 | NEON | 981 | 6.8 ms | $3,145 | 312 | 123 |
| c8i.2xlarge | Intel Granite | 8 | 4 | AVX-512 | 934 | 10.6 ms | $3,283 | 285 | 117 |
| m8i.2xlarge | Intel Granite | 8 | 4 | AVX-512 | 924 | 11.1 ms | $3,709 | 249 | 116 |
| mac2.metal | Apple M1 | 8 | 8 | NEON | 712 | 5.7 ms | $5,681 | 125 | 89 |
The kicker is QPS/vCPU: you pay per vCPU, and an AMD vCPU does 289 QPS while an Intel vCPU does 117 — 2.5× less for the same line item — because half of Intel's vCPUs are SMT threads sharing four real cores.
AMD Zen5 wins outright — best throughput, best latency, best $/QPS — and the old Intel Granite headline (934, matching c8i exactly) is 2.5× off the frontier. Compute vs general (c vs m) is a wash within every vendor: same silicon, and the ~4 GB working set fits both, so the extra RAM buys nothing.
Why — two multiplicative levers, no magic
The funnel is compute-bound on popcount, so:
funnel QPS ≈ C_isa · P / N
with P = physical cores, N = vectors scanned per query, and C_isa the per-core
rate measured here (1024-dim, with rerank): AVX-512 core ≈ 2.9e8 (AMD) / 2.3e8
(Intel, it downclocks a touch under heavy AVX-512); NEON core ≈ 1.2e8 (Graviton4) /
0.9e8 (M1). Two findings make the constant concrete:
1. P is physical cores, not vCPUs — SMT is a mirage. The "8 vCPU" tier is 8 real
cores on AMD/Graviton but 4 cores + hyperthreading on Intel. A thread-scaling sweep
on c8i settled it:
1 thread → 211 2 → 413 (1.96×) 4 → 850 (4.03×) 8 → 808 (−5%)
Dead-linear to 4 physical cores, then the SMT second thread regresses — it contends for the same execution units the kernel already saturates. So Intel's 8-vCPU box does the work of 4 cores; AMD's does 8. That alone is the 2× gap.
2. Popcount width is the other factor. Graviton4 has the same 8 cores as AMD yet
~half the funnel QPS (986 vs 2310) — because NEON CNT is 128-bit vs AVX-512
VPOPCNTDQ's 512-bit. Per-core: AVX-512 ≈ 2× NEON. This is "lever two" (the instruction,
not the math) showing up as a purchasing decision.
So QPS ≈ cores × popcount-width. AMD wins by maxing both (8 × 512-bit). Intel:
4 × 512. Graviton4: 8 × 128. Apple M1 (the for-the-LOLs floor): 8 × 128 on a narrower
consumer core, and on a dedicated host its $/QPS is 5× worse than c8a.
The right-sizing rule
QPS/$ ≈ C_isa · P / (N · price) → maximize cores × popcount-width per dollar:
- Buy AVX-512 physical cores, not vCPUs. The densest-real-cores-per-dollar instance
with
VPOPCNTDQwins. Here that's the AMD c-family. - Don't buy RAM you won't use. The hot scan footprint is 128 MB (codes); even with the f32 rerank store it's ~4 GB, far under any 2xlarge. Compute-optimized (least RAM/core, cheapest) beats general/memory — the spare DRAM on m/r is wasted spend.
- The RAM slack is the IVF lever. A box is CPU-bound with ~12–28 GB idle, so size the
two axes independently: RAM holds the whole shard's cells (all resident, cheap),
cores scan only the
nprobecells a query touches. Capacity is RAM-bound (huge); throughput is core-bound (P = QPS_target · N_scanned / C_isa). e.g. 10k QPS at N=1M on AVX-512 → ~35 cores ≈ 5× c8a, whose 80 GB could hold ~19× more corpus than it can scan hot. Scale cores for QPS; let spare RAM absorb more cells.
This updates the project's headline: on the best-value box the shipped engine does ~2310 QPS at 3.1 ms / 0.995 recall — the old ~960 was an Intel-core-count artifact, not the engine's ceiling.