Experiment 063
Accelerators (GPU, Trainium, FPGA): more headroom, worse economics — staying on CPU
The hardware sweep (062) crowned a CPU winner: c8a.2xlarge, 2310 QPS at 0.995 recall for $3,776/yr on-demand. The obvious next question — would specialized silicon push further? The honest answer for this engine (the within-cell 1-bit funnel): each one can add headroom, but none of them pays. This entry records why, with real prices, so the door is closed deliberately rather than left as a vague "maybe GPUs are faster."
Costs are us-east-1 on-demand, annualized ($/hr × 8760); the funnel maps to each as noted.
| class | instance | $/yr | fit to the funnel |
|---|---|---|---|
| CPU (baseline) | c8a.2xlarge | $3,776 | native — this is the engine |
| GPU | g6.xlarge (L4) | $7,050 | different engine (dense GEMM) |
| GPU | g5.xlarge (A10G) | $8,813 | different engine |
| GPU | g6e.xlarge (L40S) | $16,302 | different engine |
| GPU | p4d.24xlarge (8×A100) | $192,349 | absurd at this scale |
| GPU | p5.48xlarge (8×H100) | $482,150 | absurd at this scale |
| Inferentia | inf2.xlarge | $6,642 | poor (matmul ASIC) |
| Trainium | trn1.2xlarge | $11,771 | poor (training ASIC) |
| Trainium | trn1.32xlarge | $188,340 | poor |
| FPGA | f1.2xlarge | $14,454 | best fit, worst ROI |
| FPGA | f2.6xlarge | $17,345 | best fit, worst ROI |
GPU — the only credible win, and it's a different product
Tensor cores don't do binary popcount, so you don't run the funnel — you run dense
exact (or int8) brute-force as a batched GEMM (cuVS / FAISS-GPU). Batched, a single
L4/A10G-class card plausibly does ~10–40k QPS exact, which on raw throughput-per-dollar
can edge past the CPU. But the caveats are the whole story: it only wins batched
(a single query is a GEMV at <5% utilization), top-K selection is off-engine overhead, and
you're buying exactness you may not need in place of a 0.995-recall funnel. So a GPU
adds headroom by changing the engine, not by accelerating ours. It's worth it only for
huge-batch, exact workloads — the honest evaluation ($/QPS at fixed recall, batch sweep) is
parked in questions/accelerator-gemm.md, not built here.
Trainium / Inferentia — wrong shape
These are matmul ASICs for large, dense GEMMs. The funnel is the opposite of a GEMM (its whole point is doing less work), binary codes would have to inflate to int8 — 8× the bytes, which deletes the advantage that won the project — top-K doesn't map, and it's a Neuron-SDK rewrite. You'd pay $6.6k–$11.8k/yr for a strictly worse fit than a $3.8k CPU. Nothing here recommends it.
FPGA — perfect for the algorithm, terrible for the budget
This is the one accelerator that fits the math: XOR + popcount is native to FPGA fabric (bit-parallel, thousands of popcounts per cycle), so the funnel maps beautifully. But the scan is still bandwidth-bound feeding codes from DRAM — the same wall as the CPU — so the fabric advantage is capped, while the costs are not: $14k–17k/yr (≈4× the CPU) plus months of HDL/HLS engineering. It only amortizes at hyperscale with custom silicon, which is a different company than this one.
Verdict
The CPU funnel at $3,776/yr is the perf-per-buck floor, and nothing here clearly beats it for our workload at our scale:
- GPU beats it only by switching to a different engine (dense batched exact) for a different workload (huge batch, exact) — a real but separate product, parked.
- Trainium/Inferentia are the wrong architecture and cost more for less.
- FPGA fits the algorithm but loses on bandwidth, dollars, and engineering-months.
Every one of them "could push performance" — and every one fails the only test that
matters here, $/QPS at the recall we ship. So this is a deliberate stop, not an
oversight: the engine stays on CPU, and the accelerator question is closed (the GPU
path remains parked, by $/QPS, in questions/accelerator-gemm.md).