← the writeup
Notes · 74 entries
Notes
Mostly numbered, measured experiments — wins and dead ends alike. The writeup curates what worked; this is the raw trail, in order. A few ♫ notes are parked directions and external references, documented but not measured.
- 001Exact brute-force baseline
- 002Networked serving + user-facing latency
- 003Query batching, and the bandwidth myth
- 004Cross-arch baseline: AVX-512 doesn't flip it either
- 005Bound detector: compute-bound on both NEON and AVX-512
- 006FMA + parallel-accumulator kernel; the width experiment
- 007Granite Rapids: the cascade closes (batching finally pays)
- 008int8 quantization: a bandwidth win (so it depends on the machine)
- 009Binary quantization + two-stage rerank (the Exa funnel)
- 010Asymmetric scoring (full-precision query × binary doc)
- 011Asymmetric LUT (precompute + indexed lookup)
- 012Multi-accumulator hamming: a negative result (popcount is already SIMD)
- 013Counting selection for the binary top-C (a latency/throughput tradeoff)
- 014Truncated (prefix) binary scan: trade recall for QPS
- 015int8 rerank tier: a memory lever, not a speed lever
- 016Tiled binary scan: +73% QPS at identical recall
- 017Serve the winner: binary+rerank through HTTP
- 018Combined recall↔QPS Pareto (tiling × prefix × C)
- 019Serving latency knob: prefix scan under load
- 020Capstone: the performance effort (012–020)
- 021Intra-query parallelism: 4.4× lower single-query latency
- 022Parallel rerank: trim the latency tail
- 023Register-tiled hamming kernel: another negative result
- 024Rerank candidate prefetch: neutral
- 025Latency floor: the combined single-query stack (021–025 capstone)
- 026Random rotation before binarization (RaBitQ/ITQ): free recall
- 027Rotation rescues prefix truncation (Matryoshka-like)
- 028RaBitQ unbiased estimator: best recall-per-bit (slow scan)
- 029Rotated combined Pareto: research payoff, deployable
- 030Research capstone (RaBitQ block, 026–030)
- 031ADSampling: faster exact scan via early-terminated distances
- 032PDX vertical layout: 2× faster exact scan, free
- 033ADSampling on PDX: pruning speedup restored
- 034ADSampling rerank, stacked on the binary funnel
- 035Capstone: training-free research arc (031–035)
- 036Tiling + ADSampling rerank: not a Pareto win
- 037bf16 rerank store: not a Pareto win
- 038Tile re-tuning: +7.5% QPS, free (a clean Pareto win)
- 039Carousel (cooperative scan-sharing) under bursty load
- 040Sharded carousel: lowest latency floor, lower capacity
- 041Carousel capstone: fan-out is the dial, adapt it to load
- 042IVF is the premise, not a component (scope entry)
- 043Binary funnel vs HNSW *inside one cell*: dimensionality decides
- 044Quantization verdict: the rotated binary funnel is the frontier
- 045Disk-resident vectors: how slow without RAM, and the economics
- 046Cell size × residual encoding: centering is a free recall gain
- 047Scale to 100M: the binary scan has no memory cliff
- 048Query-adaptive funnel width + a per-query certificate
- 049Capstone: a predictive roofline model for the funnel (QPS & recall)
- 050SIMD step 1 (safe hints): the binary scan is memory-bound, not popcount-bound
- 051The stacked engine: residual wired in, the full frontier
- 052Carousel × disk: the scan shares, the rerank doesn't
- 053Carousel serves the real engine (best codes), and why fan-out must adapt
- 054Product Quantization: wins recall-per-byte, loses recall-per-QPS
- 055ITQ: a learned rotation beats random, but only by ~1.5 pts
- 056OPQ: learned rotation lifts PQ recall, but build cost is brutal & QPS unchanged
- 057Capstone: the best stacked implementation
- 058Bit-floor: fewer scan bits is NOT a QPS lever (negative)
- 059SIMD ADC (PQ4 + swizzle_dyn): gather isn't dead — it ties/beats popcount
- 0603-tier funnel (PQ-prune): ~8× fewer disk reads at 0.99 recall
- 061Final capstone: the best of the best
- 062Hardware cost right-sizing: throughput is physical cores × popcount width
- 063Accelerators (GPU, Trainium, FPGA): more headroom, worse economics — staying on CPU
- 064c8a scaling spectrum: where the funnel hits the DDR5 bandwidth wall
- 065Matryoshka-256 binary funnel + rerank (OpenAI text-embedding-3)
- 066Matryoshka-256: learned rotation, code uniqueness, and the regime flip
- ♫Search-as-GEMM on an accelerator (GPU / Trainium / Inferentia)
- ♫ANN-Benchmarks
- ♫Binary-quant search in production (our funnel, shipped) + SIMD kernels
- ♫Search on an FPGA (AWS F2 / AMD Virtex UltraScale+ HBM)
- ♫HNSW (graph ANN) & the `hnsw_rs` crate
- ♫PDX & ADSampling (the dimension-pruning / layout frontier)
- ♫RaBitQ (and the binary-quantization frontier)
- ♫SIMD ADC fast-scan & score-aware quantization (FAISS PQ4, ScaNN)