Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB) by Christopher-Lee-McClendon · Pull Request #456 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-03-22T20:02:06Z

Legal Score-First TTT (10L, 1.1532 BPB)

10-layer GPT with competition-legal score-first full-model test-time training,
mixed int5/int6 quantization, and community-standard architecture components.

Metric	Value
val_bpb	1.15321496
Pre-TTT val_bpb	1.1600
Training	5,200 steps, 2,283 s on 4×A100-40GB
Eval + TTT	458 s
Artifact	15,980,085 / 16,000,000 bytes

What's novel

The main contribution is competition-legal full-model TTT integrated into
sliding-window evaluation. Prior legal TTT work (PR #77) used per-document
LoRA adapters with resets. This submission replaces that with a chunked
score-first loop over all 25.5 M parameters — no LoRA, no adapter resets
between documents — giving the model persistent memory across the entire
validation set.

eval_val_sliding_ttt() divides validation into 32 k-token chunks, scores
each chunk first (satisfying the "already graded" rule), then trains with one
AdamW step per chunk. Cosine LR decay across chunks prevents catastrophic
forgetting. Improvement: 1.1600 → 1.1532 BPB (−0.0068).

Architecture summary

10 layers, d_model=512, 8 heads / 4 KV heads (GQA 2:1), 3× relu² MLP,
BigramHash(10 240), SmearGate, XSA on last 3 layers, U-Net skip connections.
Depth recurrence infrastructure exists in the code but is not active
(unique_layers = num_layers = 10).

Training recipe

Muon + AdamW, lr 0.025/0.035/0.025 (matrices/embeddings/scalars),
786 432 tokens/step, 20 warmup → 3 000 warmdown, SWA from step 4 650,
Late QAT, GPTQ-lite on 75 % of layers, zstd-22 compression.

TTT details

Score-first chunked loop (32 768 tokens/chunk, 1 epoch each)
AdamW lr=0.0005, full model unfrozen, cosine decay across chunks
Persistent adaptation (no resets between documents)

Credits

This submission builds on work from many contributors to the parameter-golf competition:

Muon optimizer — Baseline (modded-nanogpt); Newton-Schulz orthogonal preconditioning
BigramHash embeddings — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman)
SmearGate — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman)
XSA — PR [Closed] EMA + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) #187 (Idan3011); GQA-aware variant in PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (unnir)
U-Net skip connections — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman), PR SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708) #69 (TevBenji)
Mixed int5/int6 quantization — PR 12L Int5-MLP + SmearGate + BigramHash + SWA (val_bpb 1.1433) #76 (unixmadtoonslab / Will DePue)
SWA — PR SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708) #69 (TevBenji)
Late QAT — PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 (jfprincz), PR Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) #374 (unnir)
Sliding window evaluation — PR Record: Sliding Window Eval (stride=64), val_bpb=1.1925 #50 (mattqlf / Matthew Li)
Legal TTT framework — PR [record bpb=1.195] sliding window + LoRA TTT #77 (samacqua); full-model variant is our novel contribution
ReLU², GQA — Baseline (modded-nanogpt)

Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.

…10L) - 10-layer GPT with depth recurrence, BigramHash, SmearGate, XSA, U-Net skips - Mixed int5/int6 quantization + zstd-22 compression (15.9MB artifact) - Competition-legal score-first TTT: scores each chunk before training on it - val_bpb: 1.1532 (pre-TTT: 1.1600) - Trained on 4xA100-40GB, 5200 steps, 2283s training + 458s eval

MatoTeziTanka · 2026-04-11T20:09:24Z

Community Review — Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)

BPB: 1.1532 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA f5e802ba4cf4, file records/track_non_record_16mb/2026-03-22_DepthRecurrence_AggressiveTTT_10L/train_gpt.py):

The TTT path at line 757 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=10, vocab=1024, code=66874 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=10, vocab=1024, code=66874 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Christopher-Lee-McClendon mentioned this pull request Mar 22, 2026

Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

Open

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Christopher-Lee-McClendon mentioned this pull request Mar 23, 2026

Non-record: 11L + 30-Epoch Legal TTT (BPB 1.14252) #526

Open

Tighten README and submission metadata framing

f5e802b

Christopher-Lee-McClendon force-pushed the submission/depth-recurrence-legal-ttt-10L branch from ec44c24 to f5e802b Compare March 23, 2026 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/depth-recurrence-legal-ttt-10L

Christopher-Lee-McClendon commented Mar 22, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Christopher-Lee-McClendon commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Legal Score-First TTT (10L, 1.1532 BPB)

What's novel

Architecture summary

Training recipe

TTT details

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Christopher-Lee-McClendon commented Mar 22, 2026 •

edited

Loading