Skip to content

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456

Open
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/depth-recurrence-legal-ttt-10L
Open

Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)#456
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/depth-recurrence-legal-ttt-10L

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 22, 2026

Legal Score-First TTT (10L, 1.1532 BPB)

10-layer GPT with competition-legal score-first full-model test-time training,
mixed int5/int6 quantization, and community-standard architecture components.

Metric Value
val_bpb 1.15321496
Pre-TTT val_bpb 1.1600
Training 5,200 steps, 2,283 s on 4×A100-40GB
Eval + TTT 458 s
Artifact 15,980,085 / 16,000,000 bytes

What's novel

The main contribution is competition-legal full-model TTT integrated into
sliding-window evaluation. Prior legal TTT work (PR #77) used per-document
LoRA adapters with resets. This submission replaces that with a chunked
score-first loop over all 25.5 M parameters — no LoRA, no adapter resets
between documents — giving the model persistent memory across the entire
validation set.

eval_val_sliding_ttt() divides validation into 32 k-token chunks, scores
each chunk first (satisfying the "already graded" rule), then trains with one
AdamW step per chunk. Cosine LR decay across chunks prevents catastrophic
forgetting. Improvement: 1.1600 → 1.1532 BPB (−0.0068).

Architecture summary

10 layers, d_model=512, 8 heads / 4 KV heads (GQA 2:1), 3× relu² MLP,
BigramHash(10 240), SmearGate, XSA on last 3 layers, U-Net skip connections.
Depth recurrence infrastructure exists in the code but is not active
(unique_layers = num_layers = 10).

Training recipe

Muon + AdamW, lr 0.025/0.035/0.025 (matrices/embeddings/scalars),
786 432 tokens/step, 20 warmup → 3 000 warmdown, SWA from step 4 650,
Late QAT, GPTQ-lite on 75 % of layers, zstd-22 compression.

TTT details

  • Score-first chunked loop (32 768 tokens/chunk, 1 epoch each)
  • AdamW lr=0.0005, full model unfrozen, cosine decay across chunks
  • Persistent adaptation (no resets between documents)

Credits

This submission builds on work from many contributors to the parameter-golf competition:

Built on the parameter-golf starter code by Beren Millidge & Keller Jordan.

…10L)

- 10-layer GPT with depth recurrence, BigramHash, SmearGate, XSA, U-Net skips
- Mixed int5/int6 quantization + zstd-22 compression (15.9MB artifact)
- Competition-legal score-first TTT: scores each chunk before training on it
- val_bpb: 1.1532 (pre-TTT: 1.1600)
- Trained on 4xA100-40GB, 5200 steps, 2283s training + 458s eval
@Christopher-Lee-McClendon Christopher-Lee-McClendon force-pushed the submission/depth-recurrence-legal-ttt-10L branch from ec44c24 to f5e802b Compare March 23, 2026 15:39
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record submission: Depth Recurrence + Legal Score-First TTT (10L, 1.1532 BPB)

BPB: 1.1532 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA f5e802ba4cf4, file records/track_non_record_16mb/2026-03-22_DepthRecurrence_AggressiveTTT_10L/train_gpt.py):

The TTT path at line 757 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=10, vocab=1024, code=66874 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=10, vocab=1024, code=66874 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants