Experiment 066
Matryoshka-256: learned rotation, code uniqueness, and the regime flip
Perf record: 066-matryoshka-learned-rotation-regime-flip.json.
c8a.4xlarge (Zen5, 16 vCPU), continuing 065 on the OpenAI Matryoshka-256 dataset.
Three results, each of which closes a question.
1. Learned rotation (ITQ) stacks with residual — +0.3 pts, free at query time
055 shelved ITQ on Cohere-1024 (+1.5 pts, "dense rotation too costly"). Re-tested at
256 bits with residual, via itq --residual (the bin's random column reproduces the
shipped funnel's numbers exactly, validating the comparison):
| rerank C | random+residual | ITQ+residual | delta |
|---|---|---|---|
| 500 | 0.9474 | 0.9541 | +0.0067 |
| 1000 | 0.9731 | 0.9780 | +0.0049 |
| 2000 | 0.9878 | 0.9907 | +0.0029 |
| 4000 | 0.9944 | 0.9968 | +0.0024 |
| 8000 | 0.9979 | 0.9992 | +0.0013 |
The two levers overlap: ITQ alone was +0.55 at C=2000, but +0.29 on top of residual — both improve code quality, so there's less left. The 055 cost objection dissolves on a static index: the 256×256 rotation is applied once at build; the query-time scan is byte-identical popcount. Free recall, smaller than hoped, real.
2. All 990k codes are unique — recall loss is ranking error, not identity loss
Counted exact duplicates of the funnel's stage-1 codes (residual → rotate → sign): 990,000 / 990,000 unique, zero collisions (with or without residual). Expected — 2^256 code space, birthday bound ~2^128 — but worth pinning: the 1-bit code loses no identity. The recall gap is purely ranking (true neighbors landing a few Hamming bits outside the top-C), which is exactly what rerank width and better rotations fix, and why the funnel dials smoothly to 0.999.
It also answers "would sorted/compressed codes help?" — no. Rotation maximizes entropy per bit by design, so sorted codes share only ~log2(N)≈20 leading bits (~8% compressible, best case), and any decode adds instructions to a loop that costs ~4.
3. The regime flip: at 32 MB of codes, tiling hurts — the scan is compute-bound
The Cohere funnel (122 MB codes, DRAM-streamed) needed tiling: +73% (016). At 256 bits the store is 31.7 MB ≈ L3-resident on Zen5, and tiling costs:
| batch | scan-only QPS | C=2000 QPS |
|---|---|---|
| 1 | 8,861 | 4,966 |
| 8 | 6,792 | 4,148 |
| 32 | 6,251 | — |
Same lesson as 038 (sweep the tile per machine), inverted: when codes fit in cache, the right tile is no tile. The 003→007 re-pricing cascade again — every byte reduction slides the workload toward compute-bound, and levers must be re-measured.
Final operating point (Matryoshka-256, this box)
ITQ + residual + batch=1 + C=2000 → recall@10 0.9907, ~4,966 QPS, p50 3.1 ms, p99 3.3 ms, 31.7 MB codes + 1.0 GB f32 store, 990k vectors, 16 vCPU.
Caveats
- The regime flip is a corpus-size result, not a code-size law. 990k × 32 B = 31.7 MB happens to fit in L3 on this box — that's why the scan reads compute-bound and tiling hurts. At 100M vectors the same 256-bit codes are 3.2 GB, DRAM-streamed, and tiling should pay again (047 measured exactly that cache→DRAM transition). This dataset is too small to say more than: re-sweep the tile when bytes-per-corpus changes.
- QPS on 16 vCPU (4xlarge) — the 062/064 sweep used 8 vCPU; don't compare across.
- ITQ recall measured in the
itqbench bin; not yet wired intovsearch/server as a flag (the scan+rerank workload is identical, but end-to-end serving numbers with ITQ codes were not separately measured). - One embedding (OpenAI text-embedding-3-large, closed model / open vectors). The fully-open reproduction path is Nomic v1.5 (MRL, Apache-2.0), unbuilt.