notes

Experiment 062

Hardware cost right-sizing: throughput is physical cores × popcount width

The engine had only ever run on one box (Intel Granite Rapids, the old ~960-QPS headline). This entry runs the same engine across seven instances — three vendors × two families plus an Apple-Silicon ringer — to answer one question: at a fixed spec, which silicon gives the most QPS per dollar?

Method — everything held constant except the silicon

Identical committed code, identical Cohere v3 (1M × 1024) data, identical flags on every box: the shipped 1-bit funnel (--rotate 2 --residual --batch 8 --rerank 500), built with -C target-cpu=native so each CPU autovectorizes to its widest popcount (AVX-512 VPOPCNTDQ on Intel/AMD, NEON CNT on Graviton/Apple — verified in the disassembly). Every box is the advised .2xlarge (8 vCPU) instance — exactly the unit AWS sells; the physical-core counts below are just what's under that 8-vCPU label. Because nothing varies but the chip, recall is a constant 0.9952 across all seven — the comparison is purely QPS, p50, and QPS-per-dollar. ($/hr = us-east-2 on-demand, one region for all, to strip the regional pricing of the AMD boxes — which only run in Tokyo — from the comparison.)

Result

All are 8-vCPU instances; "phys" is what's under that vCPU label (Intel = 4 cores + SMT). Cost is us-east-2 on-demand, annualized (24×365h).

instance (8 vCPU)vendorvCPUphyspopcountfunnel QPSp50$/yrQPS per $1k/yrQPS/vCPU
c8a.2xlargeAMD Zen588AVX-51223103.1 ms$3,776612289
m8a.2xlargeAMD Zen588AVX-51223133.1 ms$4,265542289
c8g.2xlargeGraviton488NEON9866.6 ms$2,785354123
m8g.2xlargeGraviton488NEON9816.8 ms$3,145312123
c8i.2xlargeIntel Granite84AVX-51293410.6 ms$3,283285117
m8i.2xlargeIntel Granite84AVX-51292411.1 ms$3,709249116
mac2.metalApple M188NEON7125.7 ms$5,68112589

The kicker is QPS/vCPU: you pay per vCPU, and an AMD vCPU does 289 QPS while an Intel vCPU does 117 — 2.5× less for the same line item — because half of Intel's vCPUs are SMT threads sharing four real cores.

AMD Zen5 wins outright — best throughput, best latency, best $/QPS — and the old Intel Granite headline (934, matching c8i exactly) is 2.5× off the frontier. Compute vs general (c vs m) is a wash within every vendor: same silicon, and the ~4 GB working set fits both, so the extra RAM buys nothing.

Why — two multiplicative levers, no magic

The funnel is compute-bound on popcount, so:

funnel QPS  ≈  C_isa · P / N

with P = physical cores, N = vectors scanned per query, and C_isa the per-core rate measured here (1024-dim, with rerank): AVX-512 core ≈ 2.9e8 (AMD) / 2.3e8 (Intel, it downclocks a touch under heavy AVX-512); NEON core ≈ 1.2e8 (Graviton4) / 0.9e8 (M1). Two findings make the constant concrete:

1. P is physical cores, not vCPUs — SMT is a mirage. The "8 vCPU" tier is 8 real cores on AMD/Graviton but 4 cores + hyperthreading on Intel. A thread-scaling sweep on c8i settled it:

1 thread → 211   2 → 413 (1.96×)   4 → 850 (4.03×)   8 → 808 (−5%)

Dead-linear to 4 physical cores, then the SMT second thread regresses — it contends for the same execution units the kernel already saturates. So Intel's 8-vCPU box does the work of 4 cores; AMD's does 8. That alone is the 2× gap.

2. Popcount width is the other factor. Graviton4 has the same 8 cores as AMD yet ~half the funnel QPS (986 vs 2310) — because NEON CNT is 128-bit vs AVX-512 VPOPCNTDQ's 512-bit. Per-core: AVX-512 ≈ 2× NEON. This is "lever two" (the instruction, not the math) showing up as a purchasing decision.

So QPS ≈ cores × popcount-width. AMD wins by maxing both (8 × 512-bit). Intel: 4 × 512. Graviton4: 8 × 128. Apple M1 (the for-the-LOLs floor): 8 × 128 on a narrower consumer core, and on a dedicated host its $/QPS is 5× worse than c8a.

The right-sizing rule

QPS/$ ≈ C_isa · P / (N · price)maximize cores × popcount-width per dollar:

This updates the project's headline: on the best-value box the shipped engine does ~2310 QPS at 3.1 ms / 0.995 recall — the old ~960 was an Intel-core-count artifact, not the engine's ceiling.