Experiment 065

Matryoshka-256 binary funnel + rerank (OpenAI text-embedding-3)

Perf record: 065-matryoshka-256-openai-funnel.json. c8a.4xlarge (Zen5, 16 vCPU) spot, us-east-2. First run on a genuine Matryoshka embedding — closes the question left open by the prefix experiments (014/027): Cohere v3 isn't Matryoshka, so truncating it to 256 collapsed recall to ~0.72. Here we use an embedding trained to be truncatable.

Setup

Dataset: OpenAI text-embedding-3-large, precomputed (Qdrant's dbpedia-entities 1M set — no embedding to run). text-embedding-3-large is MRL-trained (256 is a documented operating point). We slice the native 1536-d vectors to 256 and L2-renormalize (OpenAI's Matryoshka recipe). 990k base + 10k queries, exact GT.
Engine: the shipped 1-bit binary funnel — sign-bit codes (256 bits = 32 B/vec), rotation ×2 + residual (both free recall), tile=8 — then exact f32 rerank of the top-C shortlist. Same code path as the Cohere runs; only the data differs.

Result — recall dials from 0.95 to 0.998 on the rerank width

rerank C	recall@10	QPS	p50	p99
500	0.9474	6041	2.05 ms	2.10 ms
1000	0.9731	5275	2.41 ms	2.49 ms
2000	0.9878	4227	3.09 ms	3.15 ms
4000	0.9944	3092	4.40 ms	4.58 ms
8000	0.9979	2026	6.54 ms	6.88 ms

Chosen operating point: C=2000 — recall 0.9878, 4227 QPS, p50 3.1 ms, p99 3.15 ms. (+4 recall points over C=500 for ~30% fewer QPS and <1 ms more latency.)

What it shows

Matryoshka holds — no collapse. A non-Matryoshka 256-prefix capped low (0.72, the "bit-floor" dead end). Here recall climbs smoothly to 0.998: the 256-bit code + rotation/residual produces a high-quality Hamming shortlist, so the true neighbours are in the top-C and rerank recovers them. 256-bit + rerank is a real, tunable operating point on a Matryoshka embedding.
256 bits is compact and fast — 31.7 MB of codes for 990k vectors; even at 0.998 recall it holds ~2000 QPS at ~6.5 ms p50.

Caveats

Box size differs from the hardware sweep. This ran on a 16-vCPU c8a.4xlarge; the Cohere sweep (062/064) used 8-vCPU .2xlarge. So the QPS here is not directly comparable to the ~2310 QPS Zen5 Cohere-1024 number. Recall is CPU-independent, so the recall figures are final; the throughput needs an equal-box re-run to compare.
One dataset/embedding (OpenAI text-embedding-3-large). The model is closed; only the vectors are open. A fully-open reproduction would embed with Nomic v1.5 (MRL).
Only the 256 dim measured here; the full 512/768/1024/1536 recall-vs-bits curve on the same box was not run.