Notes · 74 entries

Notes

Mostly numbered, measured experiments — wins and dead ends alike. The writeup curates what worked; this is the raw trail, in order. A few ♫ notes are parked directions and external references, documented but not measured.

001Exact brute-force baseline
002Networked serving + user-facing latency
003Query batching, and the bandwidth myth
004Cross-arch baseline: AVX-512 doesn't flip it either
005Bound detector: compute-bound on both NEON and AVX-512
006FMA + parallel-accumulator kernel; the width experiment
007Granite Rapids: the cascade closes (batching finally pays)
008int8 quantization: a bandwidth win (so it depends on the machine)
009Binary quantization + two-stage rerank (the Exa funnel)
010Asymmetric scoring (full-precision query × binary doc)
011Asymmetric LUT (precompute + indexed lookup)
012Multi-accumulator hamming: a negative result (popcount is already SIMD)
013Counting selection for the binary top-C (a latency/throughput tradeoff)
014Truncated (prefix) binary scan: trade recall for QPS
015int8 rerank tier: a memory lever, not a speed lever
016Tiled binary scan: +73% QPS at identical recall
017Serve the winner: binary+rerank through HTTP
018Combined recall↔QPS Pareto (tiling × prefix × C)
019Serving latency knob: prefix scan under load
020Capstone: the performance effort (012–020)
021Intra-query parallelism: 4.4× lower single-query latency
022Parallel rerank: trim the latency tail
023Register-tiled hamming kernel: another negative result
024Rerank candidate prefetch: neutral
025Latency floor: the combined single-query stack (021–025 capstone)
026Random rotation before binarization (RaBitQ/ITQ): free recall
027Rotation rescues prefix truncation (Matryoshka-like)
028RaBitQ unbiased estimator: best recall-per-bit (slow scan)
029Rotated combined Pareto: research payoff, deployable
030Research capstone (RaBitQ block, 026–030)
031ADSampling: faster exact scan via early-terminated distances
032PDX vertical layout: 2× faster exact scan, free
033ADSampling on PDX: pruning speedup restored
034ADSampling rerank, stacked on the binary funnel
035Capstone: training-free research arc (031–035)
036Tiling + ADSampling rerank: not a Pareto win
037bf16 rerank store: not a Pareto win
038Tile re-tuning: +7.5% QPS, free (a clean Pareto win)
039Carousel (cooperative scan-sharing) under bursty load
040Sharded carousel: lowest latency floor, lower capacity
041Carousel capstone: fan-out is the dial, adapt it to load
042IVF is the premise, not a component (scope entry)
043Binary funnel vs HNSW *inside one cell*: dimensionality decides
044Quantization verdict: the rotated binary funnel is the frontier
045Disk-resident vectors: how slow without RAM, and the economics
046Cell size × residual encoding: centering is a free recall gain
047Scale to 100M: the binary scan has no memory cliff
048Query-adaptive funnel width + a per-query certificate
049Capstone: a predictive roofline model for the funnel (QPS & recall)
050SIMD step 1 (safe hints): the binary scan is memory-bound, not popcount-bound
051The stacked engine: residual wired in, the full frontier
052Carousel × disk: the scan shares, the rerank doesn't
053Carousel serves the real engine (best codes), and why fan-out must adapt
054Product Quantization: wins recall-per-byte, loses recall-per-QPS
055ITQ: a learned rotation beats random, but only by ~1.5 pts
056OPQ: learned rotation lifts PQ recall, but build cost is brutal & QPS unchanged
057Capstone: the best stacked implementation
058Bit-floor: fewer scan bits is NOT a QPS lever (negative)
059SIMD ADC (PQ4 + swizzle_dyn): gather isn't dead — it ties/beats popcount
0603-tier funnel (PQ-prune): ~8× fewer disk reads at 0.99 recall
061Final capstone: the best of the best
062Hardware cost right-sizing: throughput is physical cores × popcount width
063Accelerators (GPU, Trainium, FPGA): more headroom, worse economics — staying on CPU
064c8a scaling spectrum: where the funnel hits the DDR5 bandwidth wall
065Matryoshka-256 binary funnel + rerank (OpenAI text-embedding-3)
066Matryoshka-256: learned rotation, code uniqueness, and the regime flip
♫Search-as-GEMM on an accelerator (GPU / Trainium / Inferentia)
♫ANN-Benchmarks
♫Binary-quant search in production (our funnel, shipped) + SIMD kernels
♫Search on an FPGA (AWS F2 / AMD Virtex UltraScale+ HBM)
♫HNSW (graph ANN) & the `hnsw_rs` crate
♫PDX & ADSampling (the dimension-pruning / layout frontier)
♫RaBitQ (and the binary-quantization frontier)
♫SIMD ADC fast-scan & score-aware quantization (FAISS PQ4, ScaNN)