notes

Experiment 047

Scale to 100M: the binary scan has no memory cliff

Perf record: 047-scale-hierarchy-cliff.json. Granite box (Xeon 6975P-C, 8 vCPU, 15 GB RAM; L3 480 MB). src/bin/scale.rs. Direction 8 of the breakout loop. Synthetic random binary codes (1024 bits = 128 B/vector), scan-only (knn_binary_funnel_tiled, c=0); an f32 store is infeasible at 100M (400 GB). q = cores × tile so every config runs exactly 8 base-passes across all 8 cores (fair, saturated).

The question

L3 here is 480 MB → holds ~3.75M of our 128 B codes. Past that the scan reads from DRAM. Does the funnel cliff at the L3→DRAM boundary, and does "tiling wins / carousel never cliffs" survive to 100M?

Result — no cliff; throughput is flat across 1000× in N

Scan compute rate (Gcmp/s = vector-comparisons/s) and real DRAM traffic (GB/s):

NGB codestile=1 Gcmp/s · GB/stile=8 Gcmp/sQPS t1 → t8
100k0.010.76 · 981.127649 → 11246
1M0.130.53 · 681.13530 → 1130
3M0.380.60 · 771.15200 → 384
10M1.280.65 · 831.1465 → 114
30M3.840.60 · 761.1520 → 38
100M12.800.62 · 791.126.2 → 11.2

Tiling's win survives at every scale (and saturates at 8)

tile=1 → tile=8 is a consistent ~1.8× QPS at every N (100k through 100M). tile=16/32 add nothing over tile=8 (Gcmp/s ~1.1 flat) — matching the tile=8 sweet spot from 038. Tiling converts the tile=1 load-bound regime (0.6 Gcmp/s) into the compute-bound regime (1.1 Gcmp/s) by reusing each loaded doc-word across the tile; once compute-bound, more reuse can't help. The factor is constant, not growing with N, because the scan never becomes more memory-bound as it grows.

Why this is the whole scaling thesis

This is exactly why binary scales and f32 doesn't. An f32 store is 4 KB/vec — 32× the bytes — so its scan is hard DRAM-bound and would cliff (and is anyway infeasible to hold at 100M). The binary code's compactness keeps the scan compute-bound, so it rides flat through the cache hierarchy. The "carousel never cliffs" serving property (039–041) survives at scale for the same reason: the shared scan it rides on doesn't cliff.

Absolute: a single 8-core box scans 100M × 1024-bit in ~90 ms (tile=8, 11 QPS) — one (very large) cell, exact-Hamming, codes-only at 12.8 GB RAM.

Conclusions

  1. The binary scan has no L3→DRAM cliff — throughput is flat within ~20% across 100k→100M because it's compute(popcount)-bound, not memory-bound, thanks to the 128 B/vec representation.
  2. Tiling's ~1.8× holds at every scale and saturates at tile=8 — the 038 sweet spot is scale-invariant.
  3. Compactness is the scaling property: f32 would cliff and is infeasible; binary stays compute-bound to 100M. Validates binary-scan-only as the path to scale.

Caveats