Experiment 047
Scale to 100M: the binary scan has no memory cliff
Perf record: 047-scale-hierarchy-cliff.json.
Granite box (Xeon 6975P-C, 8 vCPU, 15 GB RAM; L3 480 MB). src/bin/scale.rs.
Direction 8 of the breakout loop. Synthetic random binary codes (1024 bits =
128 B/vector), scan-only (knn_binary_funnel_tiled, c=0); an f32 store is
infeasible at 100M (400 GB). q = cores × tile so every config runs exactly 8
base-passes across all 8 cores (fair, saturated).
The question
L3 here is 480 MB → holds ~3.75M of our 128 B codes. Past that the scan reads from DRAM. Does the funnel cliff at the L3→DRAM boundary, and does "tiling wins / carousel never cliffs" survive to 100M?
Result — no cliff; throughput is flat across 1000× in N
Scan compute rate (Gcmp/s = vector-comparisons/s) and real DRAM traffic (GB/s):
| N | GB codes | tile=1 Gcmp/s · GB/s | tile=8 Gcmp/s | QPS t1 → t8 |
|---|---|---|---|---|
| 100k | 0.01 | 0.76 · 98 | 1.12 | 7649 → 11246 |
| 1M | 0.13 | 0.53 · 68 | 1.13 | 530 → 1130 |
| 3M | 0.38 | 0.60 · 77 | 1.15 | 200 → 384 |
| 10M | 1.28 | 0.65 · 83 | 1.14 | 65 → 114 |
| 30M | 3.84 | 0.60 · 76 | 1.15 | 20 → 38 |
| 100M | 12.80 | 0.62 · 79 | 1.12 | 6.2 → 11.2 |
- No cliff. Gcmp/s is flat within ~15–20% from 100k to 100M (a 1000× span): ~0.6 at tile=1, ~1.1 at tile≥8. L3-resident (100k) gives ~98 GB/s; DRAM (≥10M) settles at ~78 GB/s — a gentle ~20% step, not a collapse.
- Why: the 128 B/vec codes are compact enough that DRAM bandwidth (~80 GB/s aggregate) keeps the popcount units fed even at 12.8 GB. The scan is load-bound at tile=1, popcount-compute-bound at tile≥8 — and compute doesn't care where the data lives. Crossing L3→DRAM just costs the ~20% extra load latency that tiling already hides.
Tiling's win survives at every scale (and saturates at 8)
tile=1 → tile=8 is a consistent ~1.8× QPS at every N (100k through 100M). tile=16/32 add nothing over tile=8 (Gcmp/s ~1.1 flat) — matching the tile=8 sweet spot from 038. Tiling converts the tile=1 load-bound regime (0.6 Gcmp/s) into the compute-bound regime (1.1 Gcmp/s) by reusing each loaded doc-word across the tile; once compute-bound, more reuse can't help. The factor is constant, not growing with N, because the scan never becomes more memory-bound as it grows.
Why this is the whole scaling thesis
This is exactly why binary scales and f32 doesn't. An f32 store is 4 KB/vec — 32× the bytes — so its scan is hard DRAM-bound and would cliff (and is anyway infeasible to hold at 100M). The binary code's compactness keeps the scan compute-bound, so it rides flat through the cache hierarchy. The "carousel never cliffs" serving property (039–041) survives at scale for the same reason: the shared scan it rides on doesn't cliff.
Absolute: a single 8-core box scans 100M × 1024-bit in ~90 ms (tile=8, 11 QPS) — one (very large) cell, exact-Hamming, codes-only at 12.8 GB RAM.
Conclusions
- The binary scan has no L3→DRAM cliff — throughput is flat within ~20% across 100k→100M because it's compute(popcount)-bound, not memory-bound, thanks to the 128 B/vec representation.
- Tiling's ~1.8× holds at every scale and saturates at tile=8 — the 038 sweet spot is scale-invariant.
- Compactness is the scaling property: f32 would cliff and is infeasible; binary stays compute-bound to 100M. Validates binary-scan-only as the path to scale.
Caveats
- Random codes (scan-performance study, not recall) — uniform data may give a best-case popcount-branch pattern; real codes' heap-update rate differs slightly but the scan cost (Hamming over all N) dominates.
- DRAM BW (~80 GB/s) is this instance's ceiling; bigger boxes / more channels raise the absolute numbers but not the flat-vs-N shape.
- Single timed pass at the largest N (one iteration); smaller N averaged over many. The cliff verdict rests on the flat Gcmp/s, which is stable across the iteration counts.