[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645) by sseanliu · Pull Request #296 · openai/parameter-golf

sseanliu · 2026-03-21T00:28:41Z

Summary

Non-record research submission exploring test-time adaptation strategies for compressed language models at 16MB scale.

Key findings

Reptile meta-learning improves SmearGate models by 0.011 BPB — 10x better than naive TTT (+0.001), partially overcoming the SmearGate/TTT redundancy
Error-guided TTT is a negative result — concentrating adaptation on highest-loss tokens does not improve val_loss, indicating these tokens are genuinely unpredictable
13 layers beat 10 layers on 8xH100 (1.1884 vs 1.2090) despite 23% fewer training steps
Per-token loss distribution on full 62M val set: hardest 2.7% of tokens account for ~15% of total loss

Score

val_bpb: 1.1645 (sliding window, stride=64)
Artifact: 12.7MB

See README for full methodology and analysis.

…n-record)

Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation. Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function, TTT call before torch.compile in eval section.

MatoTeziTanka · 2026-04-12T14:04:54Z

Community Review — [Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)

Compliance flag: Pre-Quant TTT violation

Head SHA: e3a7958

Analysis

PR #296 contains two separate submissions. Both are disqualified.

File 1: `2026-03-20_MetaTTT_v2/train_gpt.py`

BigramHash — CLEAN. The hash function at line 592 is:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

Position i hashes (token[i], token[i-1]) — the current token and its predecessor. The target token is not used as a lookup key. This is standard BigramHash, legal.

Pre-Quant TTT — CLOSE. Phase 2 (Reptile meta-learning, lines 1313–1361) runs before quantization (lines 1378–1400). During Phase 2, Reptile performs multi-step SGD inner loops on train_data tokens (not val_tokens), updating the model weights. This is training-phase adaptation, not val-time TTT, so it does not directly trigger the Pre-Quant TTT rule on its own. However, the TTT eval (lines 1443–1457) calls eval_val_ttt() which initializes a fresh model from deq_state (the dequantized post-quant weights) and runs score-first-per-chunk with SGD updates. The TTT eval itself is score-first and structurally legal.

The disqualifying issue is in Phase 2 Reptile: the Reptile outer loop adapts TTT params (_get_ttt_param_names: MLP layers of last 1/4 blocks) using multi-step gradient updates across the training budget with the explicit goal of making those params more adaptable at test time. This is Pre-Quant TTT by intent — the model's TTT params are being shaped on the training timeline to exploit the val set at eval time, without scoring first. The inner loop trains on train shards, but the whole Phase 2 exists solely to improve TTT-eval performance, which is the prohibited pattern.

Net: CLOSE on Pre-Quant TTT (Reptile Phase 2 shapes TTT params pre-quantization without score-first discipline).

File 2: `2026-03-21_XSA_EMA_TTT/train_gpt.py`

BigramHash — CLEAN. Identical implementation to File 1 (line 689). Legal.

Pre-Quant TTT — CLOSE (clear violation). ttt_adapt() (lines 1064–1129) is called at line 1588, after dequantization of the quantized model but before the final eval. It runs ttt_epochs=3 full epochs of SGD with momentum=0.9 over all val_tokens with no scoring step at any point — pure multi-epoch AdamW-style (SGD+momentum) gradient descent on the validation set. This is the canonical Pre-Quant TTT pattern: multi-epoch optimizer on val_tokens without score-first. The fact that it runs on the dequantized model (post-quant roundtrip) rather than the raw float model is irrelevant — TTT adaptation happens to the model weights before the scored eval, without any score-first-per-chunk gating.

This PR's TTT implementation trains on validation tokens before scoring them, which violates the score-first-per-chunk discipline established in PR #1413 and the rulings in Issue #677. The legal pattern requires scoring each chunk under torch.no_grad() before any gradient step.

Verdict: CLOSE — Pre-Quant TTT violation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author restructures to score-first-per-chunk (PR #1413 pattern).

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Add MetaTTT v2: Reptile meta-learning + error-guided TTT analysis (no…

9605e98

…n-record)

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Add XSA + EMA + TTT merged train_gpt.py

e3a7958

Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation. Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function, TTT call before torch.compile in eval section.

This was referenced Mar 21, 2026

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436) #303

Open

Neural Cache: Cross-Window KV Caching for Extended Eval Context (research proposal) #318

Open

anantdgoel mentioned this pull request Mar 22, 2026

Non-record: Meta-TTT + Cache/OGD Eval Stacking + Tokenizer Ablation #384

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)#296

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)#296
sseanliu wants to merge 2 commits intoopenai:mainfrom
sseanliu:submission/metattt-v2

sseanliu commented Mar 21, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sseanliu commented Mar 21, 2026

Summary

Key findings

Score

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — [Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)

Analysis

File 1: 2026-03-20_MetaTTT_v2/train_gpt.py

File 2: 2026-03-21_XSA_EMA_TTT/train_gpt.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

File 1: `2026-03-20_MetaTTT_v2/train_gpt.py`

File 2: `2026-03-21_XSA_EMA_TTT/train_gpt.py`