notes

♫ Note

Search-as-GEMM on an accelerator (GPU / Trainium / Inferentia)

Status: parked. Out of scope for "fastest CPU box." Recorded so the direction is on record without building it. Triggered by asking whether an ML ASIC (e.g. trn1.2xlarge) can accelerate this engine.

The test (from the Exa concepts notes)

prep/exa/concepts.md §"hardware lottery": when evaluating a new architecture, ask first — does it reduce to multiply-add? For vector search the answer is yes:

So it passes the test — an ASIC can run it. The catch is the re-pricing cascade (concepts.md §16): the ASIC rewards a different engine, not ours.

Why it's a different engine, not an acceleration of the funnel

Our CPU engine (funnel)What the ASIC wants
Strategydo less — 1-bit scan, rerank ~500do all of it — dense Q×N GEMM, fast
Bytesfewer (128 MB — the whole win)int8/f16 in HBM; doesn't reward 1-bit
Latencysingle query is finesingle query = GEMV, <5% utilization — must batch
Top-Kcheap heapnot a matmul — argmax/partial-sort is awkward off-engine

The funnel's "don't compute all the distances" is the opposite of feeding a systolic array. On an ASIC you'd discard the funnel and do dense brute-force GEMM, betting raw throughput beats cleverness — and it might:

Back-of-envelope: 1M × 1024-dim exact ≈ 2 GFLOP/query. A Trainium chip is ~O(100+) TFLOP. Even at 20–30% MFU + selection overhead, exact brute force could land 10k–50k QPS — an order of magnitude past our ~960. The ASIC makes "all the work" so cheap that exact brute force out-competes the clever index. That inverts the premise of this project (which exists because doing all the work was expensive on a CPU).

Why trn1 specifically is the wrong member of the family

What would decide it (if ever un-parked)

  1. Dense int8 brute-force GEMM on a GPU (g/p family) or inf2 — measure exact-search QPS and cost-per-1k-QPS vs the CPU funnel. The honest comparison is $/QPS, not QPS.
  2. Whether on-device top-K (vector-engine WarpSelect-style) keeps the GEMM win or eats it.
  3. Whether any available ASIC has native sub-int8 (1-bit/4-bit) matmul, which would let the binary code ride the tensor engine without the 8× byte penalty.

How to set it up (if un-parked)

The honest experiment is an afternoon for the baseline, a day or two for the fair fight. The setup, in order:

  1. Integration point is the ANN-Benchmarks contract, not the Rust knn_batch trait. A GPU run is a separate harness (Python is fine) that reads the same SIFT1M fvecs base/query + ground-truth ivecs and reports recall@k / QPS / latency the same way. Nothing in database/ changes — it's a sibling experiment, like any history/ entry.

  2. FAISS-GPU for the dense exact baselineIndexFlatL2index_cpu_to_gpu. Turnkey, and it solves on-device top-K (WarpSelect). Raw PyTorch (Q @ B.T + torch.topk) is the ~10-line fallback to see the GEMM directly.

  3. Instance — pick on memory bandwidth, not generation. The scan is bandwidth-bound, so the older card can win:

    GPUMem BWNote
    g5.2xlargeA10G (Ampere)~600 GB/sdefault first data point
    g6.2xlargeL4 (Ada, 72 W)~300 GB/sbetter $/QPS & INT8, bandwidth-starved
    g6e.2xlargeL40S (Ada)~864 GB/s, 48 GBwhere the GPU actually wins throughput

    (g7 unconfirmed as of 2026-06 — verify it exists before assuming.) Pull live spot + on-demand prices before deciding; $/QPS is the metric, and it moves with spot capacity.

  4. Infra change is one file. Extend infra/lib/spot-stack.ts to take a g-family instanceType and swap the AMI from Amazon Linux 2023 to a Deep Learning AMI (CUDA + drivers preinstalled — hand-installing CUDA is the only annoying part). Spot/SSH/teardown plumbing carries over unchanged.

  5. Bake in the honesty caveats (see the table above — the GPU runs a different engine): report $/QPS at fixed recall, not raw QPS, and sweep batch size (single query = GEMV at <5% utilization; the GPU only wins batched). Note both or the comparison misleads.

Done = FAISS-GPU exact on SIFT1M on a g-family spot box, recall/QPS/latency across a batch sweep, $/QPS vs the CPU funnel at fixed recall, recorded as the next history/NNN-* pair.

Related