Skip to content

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)#296

Open
sseanliu wants to merge 2 commits intoopenai:mainfrom
sseanliu:submission/metattt-v2
Open

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)#296
sseanliu wants to merge 2 commits intoopenai:mainfrom
sseanliu:submission/metattt-v2

Conversation

@sseanliu
Copy link
Copy Markdown

Summary

Non-record research submission exploring test-time adaptation strategies for compressed language models at 16MB scale.

Key findings

  1. Reptile meta-learning improves SmearGate models by 0.011 BPB — 10x better than naive TTT (+0.001), partially overcoming the SmearGate/TTT redundancy
  2. Error-guided TTT is a negative result — concentrating adaptation on highest-loss tokens does not improve val_loss, indicating these tokens are genuinely unpredictable
  3. 13 layers beat 10 layers on 8xH100 (1.1884 vs 1.2090) despite 23% fewer training steps
  4. Per-token loss distribution on full 62M val set: hardest 2.7% of tokens account for ~15% of total loss

Score

  • val_bpb: 1.1645 (sliding window, stride=64)
  • Artifact: 12.7MB

See README for full methodology and analysis.

Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation.
Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function,
TTT call before torch.compile in eval section.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — [Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)

Compliance flag: Pre-Quant TTT violation

Head SHA: e3a7958


Analysis

PR #296 contains two separate submissions. Both are disqualified.

File 1: 2026-03-20_MetaTTT_v2/train_gpt.py

BigramHash — CLEAN. The hash function at line 592 is:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

Position i hashes (token[i], token[i-1]) — the current token and its predecessor. The target token is not used as a lookup key. This is standard BigramHash, legal.

Pre-Quant TTT — CLOSE. Phase 2 (Reptile meta-learning, lines 1313–1361) runs before quantization (lines 1378–1400). During Phase 2, Reptile performs multi-step SGD inner loops on train_data tokens (not val_tokens), updating the model weights. This is training-phase adaptation, not val-time TTT, so it does not directly trigger the Pre-Quant TTT rule on its own. However, the TTT eval (lines 1443–1457) calls eval_val_ttt() which initializes a fresh model from deq_state (the dequantized post-quant weights) and runs score-first-per-chunk with SGD updates. The TTT eval itself is score-first and structurally legal.

The disqualifying issue is in Phase 2 Reptile: the Reptile outer loop adapts TTT params (_get_ttt_param_names: MLP layers of last 1/4 blocks) using multi-step gradient updates across the training budget with the explicit goal of making those params more adaptable at test time. This is Pre-Quant TTT by intent — the model's TTT params are being shaped on the training timeline to exploit the val set at eval time, without scoring first. The inner loop trains on train shards, but the whole Phase 2 exists solely to improve TTT-eval performance, which is the prohibited pattern.

Net: CLOSE on Pre-Quant TTT (Reptile Phase 2 shapes TTT params pre-quantization without score-first discipline).


File 2: 2026-03-21_XSA_EMA_TTT/train_gpt.py

BigramHash — CLEAN. Identical implementation to File 1 (line 689). Legal.

Pre-Quant TTT — CLOSE (clear violation). ttt_adapt() (lines 1064–1129) is called at line 1588, after dequantization of the quantized model but before the final eval. It runs ttt_epochs=3 full epochs of SGD with momentum=0.9 over all val_tokens with no scoring step at any point — pure multi-epoch AdamW-style (SGD+momentum) gradient descent on the validation set. This is the canonical Pre-Quant TTT pattern: multi-epoch optimizer on val_tokens without score-first. The fact that it runs on the dequantized model (post-quant roundtrip) rather than the raw float model is irrelevant — TTT adaptation happens to the model weights before the scored eval, without any score-first-per-chunk gating.

This PR's TTT implementation trains on validation tokens before scoring them, which violates the score-first-per-chunk discipline established in PR #1413 and the rulings in Issue #677. The legal pattern requires scoring each chunk under torch.no_grad() before any gradient step.

Verdict: CLOSE — Pre-Quant TTT violation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author restructures to score-first-per-chunk (PR #1413 pattern).


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants