Skip to content

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)#294

Closed
sseanliu wants to merge 12 commits intoopenai:mainfrom
sseanliu:main
Closed

[Non-record] Meta-Learned TTT + Error-Guided Adaptation Analysis (val_bpb=1.1645)#294
sseanliu wants to merge 12 commits intoopenai:mainfrom
sseanliu:main

Conversation

@sseanliu
Copy link
Copy Markdown

Summary

Non-record research submission exploring test-time adaptation strategies for compressed language models at 16MB scale.

Key findings

  1. Reptile meta-learning improves SmearGate models by 0.011 BPB — 10x better than naive TTT (+0.001), partially overcoming the SmearGate/TTT redundancy reported in the competition
  2. Error-guided TTT is a negative result — concentrating adaptation budget on highest-loss tokens does not improve val_loss, indicating these tokens are genuinely unpredictable rather than under-adapted
  3. 13 layers beat 10 layers on 8xH100 (1.1884 vs 1.2090) despite 23% fewer training steps
  4. Per-token loss distribution analysis on full 62M val set: the hardest 2.7% of tokens (loss > 7.0) account for ~15% of total loss

Score

  • val_bpb: 1.1645 (sliding window, stride=64)
  • Artifact: 12.7MB (well under 16MB)
  • Hardware: 8x H100 SXM, 600s training

Methodology

  • Base: PR 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198 recipe (11L, int6+zstd, 3x MLP, SmearGate, BigramHash, SWA, Muon WD=0.04)
  • Reptile meta-learning: last 20% of training time, 1576 meta-steps on last 3 blocks' MLPs
  • Error-guided TTT: two-pass eval with rank-4 LoRA on top 2% highest-loss windows
  • Inspired by TTT-E2E (Sun et al., 2025) and SIFT (ICLR 2025 Best Paper)

Files

  • records/track_10min_16mb/2026-03-20_MetaTTT_v2/train_gpt.py — Training with Reptile
  • eval_error_guided_ttt.py — Error-guided TTT evaluation
  • records/track_10min_16mb/2026-03-20_MetaTTT_v2/README.md — Full analysis
  • records/track_10min_16mb/2026-03-20_MetaTTT_v2/submission.json — Metadata

See README for detailed methodology, results, and theoretical context.

sseanliu added 12 commits March 19, 2026 17:38
…x 4 recycled = 12 effective layers

Architecture: 3 unique blocks at dim=768 (12 heads, 6 KV heads) recycled 4x each
for 12 effective layers with per-iteration scale/mix params and U-Net skip connections.

13.2M unique params in ~12MB compressed (3.9MB headroom vs 16MB cap).
50% wider representations + 20% more effective depth vs SOTA's 10x512.
- Extract _apply_block helper for cleaner per-iteration logic
- Remove _block_for_layer, use direct modular indexing
- Reduce code from 1325 to 1321 lines
Three files:
- program.md: Instructions for the AI agent (experiment loop, logging, directions)
- prepare.py: Fixed utilities (data loading, evaluation, quantization, size checking)
- train.py: Modifiable baseline (SOTA architecture, the only file the agent edits)

Based on Karpey's autoresearch framework, adapted for parameter-golf constraints
(16MB artifact limit, fixed FineWeb dataset, SentencePiece 1024 vocab).
@sseanliu sseanliu closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant