Commit 4a73033
Document Run 4 (PR1851 + 9 hparams + wd_strong + AR) — best q_ttt yet
Run 4 results, single seed s42:
pre = 1.06331 (best pre of session, beats Run 0's 1.06429 by 0.00098)
q = 1.07239 (q_gap 0.00908 — tightest gap of session)
q_ttt = 1.05950 (best q_ttt of session, beats PR openai#1855 published s42
1.05989 by 0.00039)
artifact = 16,140,607 B (BUSTS 16 MB cap by 140,607 B with brotli;
PR openai#1855's pergroup compressor saves ~280 KB,
which is needed for this hparam stack to fit)
Three findings:
1. The 9 hparams transfer cleanly through to final EMA model quality.
Contrast with paired-head Muon NS (Run 3): also gave a striking
mid-train signal (-0.0046 at step 4000) but that gain converged out
by pre-quant time (+0.00038 vs Run 0). Run 4's mid-train gain
(-0.0059) carried through to pre-quant (-0.00098). Mechanism: the
9 hparams change *what's actually being trained* (tighter clipping
preserves outliers, longer warmdown reshapes convergence, tuned
TTT-LoRA reshapes recovery), not just the optimizer's update
direction.
2. Tightest quant gap of the session (0.00908). Tighter MLP/EMBED
clipping (11.5/14.0) preserves outliers that LQER asymmetric int4
rank-4 correction can exploit, on top of AR's narrowing.
3. Artifact busts cap with brotli alone — confirms PR openai#1855's claim
that their pergroup compressor saves ~280 KB on this stack. With
brotli, even PR openai#1855 itself would land ~16,180,000 B. They needed
pergroup; we need pergroup.
This run made the case to pivot to PR openai#1855 base for Run 5. Earlier
session's choice of PR openai#1851 (yesterday's "no lrzip dispute" reasoning)
overturned by Run 4's evidence: PR openai#1855 is 0.00037 BPB ahead at
3-seed mean, ships the pergroup compressor we need to fit cap, and the
9 hparams we manually applied transfer cleanly.
Run 5 (queued, auto-launch when Run 4 GPUs free) = PR openai#1855's full env
stack + our wd_strong + AR + COMPRESSOR=pergroup. Expected q_ttt
~1.0590-1.0595 single-seed; 3-seed mean ~1.0593 ± 0.001.
Honest acceptance-bar math:
SOTA = 1.06108 (PR openai#1855 3-seed mean)
Bar = SOTA - 0.005 nats ≈ 1.0588
Run 4 single = 1.05950, +0.00070 short of bar
Run 5 predicted = 1.0590-1.0595, still 0.0002-0.0007 short
Even best-case Run 5 likely just misses the record bar by ~half a sigma.
Best plausible outcome is non-record submission with documented findings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 611b598 commit 4a73033
3 files changed
Lines changed: 5793 additions & 0 deletions
0 commit comments