Skip to content

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252)#526

Open
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-ve128-partial-rope-legal-ttt-30ep
Open

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252)#526
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-ve128-partial-rope-legal-ttt-30ep

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 23, 2026

11L Depth Recurrence + 30-Epoch Legal Score-First TTT

val_bpb = 1.14252 | Pre-TTT: 1.1609 | TTT gain: −0.0184 | Artifact: 15.48 MB

Key Finding: SGD with 30 epochs per chunk yields large TTT gains

This submission builds on PR #461 (BPB 1.14458) with a single change: increasing TTT epochs from 3 to 30. A sweep from 3→30 epochs showed general BPB improvement, though not strictly monotonic at every point:

TTT Epochs BPB Δ vs 3ep Notes
3 1.14458 baseline PR #461
5 1.14399 −0.00059
7 1.14378 −0.00080
10 1.14295 −0.00163
15 1.14335 −0.00123 Non-monotonic (worse than 10ep)
20 1.14292 −0.00166
30 1.14252 −0.00206 This submission

All results are single runs (no error bars). 40-epoch and 50-epoch runs are in progress at time of submission.

Why does more SGD help for legal TTT?

Legal score-first TTT applies SGD per chunk (32K tokens): score → train → advance. With frozen early layers (freeze=2), the remaining 19.9M parameters benefit from extended optimization. The cosine LR schedule across chunks provides natural regularization, preventing overfitting even at 30 epochs.

Key experimental findings:

  • SGD >> AdamW for legal TTT (0.027 BPB gap) — Adam's moment estimates don't converge with only ~30 steps per chunk
  • freeze=2 is essential — freeze=0 causes catastrophic per-chunk overfitting in legal TTT
  • 40-epoch and 50-epoch runs in progress — results pending at time of submission

Comparison to prior submissions

The TTT gain of −0.0184 represents 2.7× more TTT improvement than our prior 1-epoch AdamW approach (−0.0068 in PR #456), and 12% more than the 3-epoch SGD baseline (−0.0165 in PR #461).

Architecture (unchanged from PR #461)

  • 11 logical layers (10 unique BlockCores with depth recurrence)
  • dim=512, 8 heads (64 dim/head), 4 KV heads (GQA)
  • MLP 3× expansion (1536), ReLU² activation; SmearGate
  • Partial RoPE (16/64 dims), Value Embeddings (128d on layers 9-10)
  • Layer-Norm Scale, XSA last 4, BigramHash(2048)
  • SWA, Late QAT, int6+zstd quantization
  • 15,479,992 bytes total (520KB headroom under 16MB limit)
  • Trained on 4×A100-40GB, 5200 steps (~41 min), eval 3662s on 1×A100

Credits

This submission builds on work from many contributors to the parameter-golf competition:

Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.

…58 BPB)

- 11-layer depth-recurrence GPT (10 unique BlockCores) with legal score-first TTT
- Novel high-yield TTT recipe: SGD+momentum(0.9), 3 epochs/chunk, freeze first 2 blocks
  delivers 2.4x more TTT gain (-0.0165 BPB) than single-epoch AdamW (-0.0068)
- Partial RoPE (16/64 dims) with NTK-aware scaling for better length generalization
- Value Embeddings (128d) on deep layers 9-10 for richer value representations
- Layer-Norm depth scaling (1/sqrt(layer+1)) for stable deep training
- XSA last 4, BigramHash(2048), SmearGate, U-Net skips, SWA, Late QAT
- Int6+zstd quantization: 14.79MB total (1.2MB headroom under 16MB limit)
- Trained on 4xA100-40GB, 5200 steps (~41 min)
- Same 11-layer architecture as PR openai#461, only change: TTT_EPOCHS 3 -> 30
- TTT gain of -0.0184 BPB (1.1609 -> 1.1425), 2.7x more than 3-epoch baseline
- Systematic epoch sweep: 3/5/10/20/30 epochs, monotonic improvement
- SGD+momentum(0.9) outperforms AdamW by 0.027 BPB for legal TTT
- 15.48MB total (520KB headroom under 16MB limit)
- Trained on 4xA100-40GB, eval 3662s on 1xA100
@Christopher-Lee-McClendon Christopher-Lee-McClendon force-pushed the submission/11L-ve128-partial-rope-legal-ttt-30ep branch from dd15641 to 6edbca9 Compare March 23, 2026 15:35
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252)

BPB: 1.14252 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 6edbca9ad123, file records/track_non_record_16mb/2026-03-22_11L_VE128_PartialRoPE_LegalTTT/train_gpt.py):

The TTT path at line 842 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71738 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=71738 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants