Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252)#526
Conversation
…58 BPB) - 11-layer depth-recurrence GPT (10 unique BlockCores) with legal score-first TTT - Novel high-yield TTT recipe: SGD+momentum(0.9), 3 epochs/chunk, freeze first 2 blocks delivers 2.4x more TTT gain (-0.0165 BPB) than single-epoch AdamW (-0.0068) - Partial RoPE (16/64 dims) with NTK-aware scaling for better length generalization - Value Embeddings (128d) on deep layers 9-10 for richer value representations - Layer-Norm depth scaling (1/sqrt(layer+1)) for stable deep training - XSA last 4, BigramHash(2048), SmearGate, U-Net skips, SWA, Late QAT - Int6+zstd quantization: 14.79MB total (1.2MB headroom under 16MB limit) - Trained on 4xA100-40GB, 5200 steps (~41 min)
dbf99ce to
dd15641
Compare
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30 - TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline - Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement - SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT - 15.48MB total (520KB headroom under 16MB limit) - Trained on 4xA100-40GB, eval 3662s on 1xA100
dd15641 to
6edbca9
Compare
Community Review — Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252)BPB: 1.14252 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 842 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71738 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71738 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
11L Depth Recurrence + 30-Epoch Legal Score-First TTT
val_bpb = 1.14252 | Pre-TTT: 1.1609 | TTT gain: −0.0184 | Artifact: 15.48 MB
Key Finding: SGD with 30 epochs per chunk yields large TTT gains
This submission builds on PR #461 (BPB 1.14458) with a single change: increasing TTT epochs from 3 to 30. A sweep from 3→30 epochs showed general BPB improvement, though not strictly monotonic at every point:
All results are single runs (no error bars). 40-epoch and 50-epoch runs are in progress at time of submission.
Why does more SGD help for legal TTT?
Legal score-first TTT applies SGD per chunk (32K tokens): score → train → advance. With frozen early layers (freeze=2), the remaining 19.9M parameters benefit from extended optimization. The cosine LR schedule across chunks provides natural regularization, preventing overfitting even at 30 epochs.
Key experimental findings:
Comparison to prior submissions
The TTT gain of −0.0184 represents 2.7× more TTT improvement than our prior 1-epoch AdamW approach (−0.0068 in PR #456), and 12% more than the 3-epoch SGD baseline (−0.0165 in PR #461).
Architecture (unchanged from PR #461)
Credits
This submission builds on work from many contributors to the parameter-golf competition:
Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.