the writeup

Notes · 74 entries

Notes

Mostly numbered, measured experiments — wins and dead ends alike. The writeup curates what worked; this is the raw trail, in order. A few notes are parked directions and external references, documented but not measured.

  1. 001Exact brute-force baseline
  2. 002Networked serving + user-facing latency
  3. 003Query batching, and the bandwidth myth
  4. 004Cross-arch baseline: AVX-512 doesn't flip it either
  5. 005Bound detector: compute-bound on both NEON and AVX-512
  6. 006FMA + parallel-accumulator kernel; the width experiment
  7. 007Granite Rapids: the cascade closes (batching finally pays)
  8. 008int8 quantization: a bandwidth win (so it depends on the machine)
  9. 009Binary quantization + two-stage rerank (the Exa funnel)
  10. 010Asymmetric scoring (full-precision query × binary doc)
  11. 011Asymmetric LUT (precompute + indexed lookup)
  12. 012Multi-accumulator hamming: a negative result (popcount is already SIMD)
  13. 013Counting selection for the binary top-C (a latency/throughput tradeoff)
  14. 014Truncated (prefix) binary scan: trade recall for QPS
  15. 015int8 rerank tier: a memory lever, not a speed lever
  16. 016Tiled binary scan: +73% QPS at identical recall
  17. 017Serve the winner: binary+rerank through HTTP
  18. 018Combined recall↔QPS Pareto (tiling × prefix × C)
  19. 019Serving latency knob: prefix scan under load
  20. 020Capstone: the performance effort (012–020)
  21. 021Intra-query parallelism: 4.4× lower single-query latency
  22. 022Parallel rerank: trim the latency tail
  23. 023Register-tiled hamming kernel: another negative result
  24. 024Rerank candidate prefetch: neutral
  25. 025Latency floor: the combined single-query stack (021–025 capstone)
  26. 026Random rotation before binarization (RaBitQ/ITQ): free recall
  27. 027Rotation rescues prefix truncation (Matryoshka-like)
  28. 028RaBitQ unbiased estimator: best recall-per-bit (slow scan)
  29. 029Rotated combined Pareto: research payoff, deployable
  30. 030Research capstone (RaBitQ block, 026–030)
  31. 031ADSampling: faster exact scan via early-terminated distances
  32. 032PDX vertical layout: 2× faster exact scan, free
  33. 033ADSampling on PDX: pruning speedup restored
  34. 034ADSampling rerank, stacked on the binary funnel
  35. 035Capstone: training-free research arc (031–035)
  36. 036Tiling + ADSampling rerank: not a Pareto win
  37. 037bf16 rerank store: not a Pareto win
  38. 038Tile re-tuning: +7.5% QPS, free (a clean Pareto win)
  39. 039Carousel (cooperative scan-sharing) under bursty load
  40. 040Sharded carousel: lowest latency floor, lower capacity
  41. 041Carousel capstone: fan-out is the dial, adapt it to load
  42. 042IVF is the premise, not a component (scope entry)
  43. 043Binary funnel vs HNSW *inside one cell*: dimensionality decides
  44. 044Quantization verdict: the rotated binary funnel is the frontier
  45. 045Disk-resident vectors: how slow without RAM, and the economics
  46. 046Cell size × residual encoding: centering is a free recall gain
  47. 047Scale to 100M: the binary scan has no memory cliff
  48. 048Query-adaptive funnel width + a per-query certificate
  49. 049Capstone: a predictive roofline model for the funnel (QPS & recall)
  50. 050SIMD step 1 (safe hints): the binary scan is memory-bound, not popcount-bound
  51. 051The stacked engine: residual wired in, the full frontier
  52. 052Carousel × disk: the scan shares, the rerank doesn't
  53. 053Carousel serves the real engine (best codes), and why fan-out must adapt
  54. 054Product Quantization: wins recall-per-byte, loses recall-per-QPS
  55. 055ITQ: a learned rotation beats random, but only by ~1.5 pts
  56. 056OPQ: learned rotation lifts PQ recall, but build cost is brutal & QPS unchanged
  57. 057Capstone: the best stacked implementation
  58. 058Bit-floor: fewer scan bits is NOT a QPS lever (negative)
  59. 059SIMD ADC (PQ4 + swizzle_dyn): gather isn't dead — it ties/beats popcount
  60. 0603-tier funnel (PQ-prune): ~8× fewer disk reads at 0.99 recall
  61. 061Final capstone: the best of the best
  62. 062Hardware cost right-sizing: throughput is physical cores × popcount width
  63. 063Accelerators (GPU, Trainium, FPGA): more headroom, worse economics — staying on CPU
  64. 064c8a scaling spectrum: where the funnel hits the DDR5 bandwidth wall
  65. 065Matryoshka-256 binary funnel + rerank (OpenAI text-embedding-3)
  66. 066Matryoshka-256: learned rotation, code uniqueness, and the regime flip
  67. Search-as-GEMM on an accelerator (GPU / Trainium / Inferentia)
  68. ANN-Benchmarks
  69. Binary-quant search in production (our funnel, shipped) + SIMD kernels
  70. Search on an FPGA (AWS F2 / AMD Virtex UltraScale+ HBM)
  71. HNSW (graph ANN) & the `hnsw_rs` crate
  72. PDX & ADSampling (the dimension-pruning / layout frontier)
  73. RaBitQ (and the binary-quantization frontier)
  74. SIMD ADC fast-scan & score-aware quantization (FAISS PQ4, ScaNN)