Record: XSA + LoRA TTT (val_bpb=1.1070) by Elarwei001 · Pull Request #1254 · openai/parameter-golf

Elarwei001 · 2026-04-02T09:26:24Z

Summary

Author: Elar Wei (@Elarwei001)

val_bpb: 1.1070

Artifact size: 14.4 MB (compressed with zlib)

Training time: ~9 min on 8×H100

Results

Metric	Value
Pre-TTT val_bpb	1.519
Post-TTT val_bpb	1.1070
TTT Improvement	-27.1%
Model Size (compressed)	14.4 MB

Approach

11 layers, d_model=416, 8 attention heads, 4 KV heads (GQA)
XSA (Exclusive Self Attention) on all layers
LoRA TTT (rank=8) on Q, V projections + LM head
QAT Int6 quantization (enabled at 15% of training)
BPE-8192 tokenizer
~20.5M parameters

Acknowledgments & Attribution

This submission builds upon the excellent work of the Parameter Golf community:

Technique	Credit
BPE-8192 tokenizer	@sproos
LoRA TTT	@LoquiAuris, @MatoTeziTanka (PR #548, #512)
XSA	@jfprincz, @unnir (PR #198)
LeakyReLU(0.5)²	@abaybektursun (PR #549)
Int6 QAT	@signalrush (PR #414)
Training stack	@raahilshah, @thwu1 (PR #162, #180)

Files

records/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/README.md — Detailed documentation
records/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/submission.json — Metadata
records/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/train_gpt.py — Training script
records/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/train_seed42.log — Training log

Special thanks to the entire Parameter Golf community for sharing techniques openly!

@Elarwei001

Author: Elar Wei (@Elarwei001) val_bpb: 1.1070 Model size: 14.4 MB Hardware: 8×H100 SXM Techniques: - XSA (Exclusive Self Attention) on all 11 layers - LoRA TTT (Test-Time Training) with rank=8 - QAT Int6 quantization - BPE-8192 tokenizer Attribution: - @sproos (BPE-8192 tokenizer) - @LoquiAuris, @MatoTeziTanka (LoRA TTT) - @jfprincz, @unnir (XSA) - @abaybektursun (LeakyReLU) - @signalrush (Int6 QAT) - @raahilshah, @thwu1 (Training stack)

Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- LoRA rank=4 on Q/K/V/O projections of last 4 layers (blocks 7-10) - SGD momentum=0.9, lr=0.002 with cosine decay across chunks - Per-block discriminative LR: block 7 at 0.6x, blocks 8-10 at 1.0x - Score-first: score chunk under inference_mode before training LoRA - 2 epochs per chunk, ~57K LoRA params total - Based on PR openai#1254 LoRA pattern + PR openai#549 score-first loop

Novel mechanism: zero-initialized nn.Embedding(4096, 512) created at eval time, trained exclusively through the standard score-first TTT loop. Learns document-local bigram patterns without modifying any artifact weights. Hash: h = (prev_token * 2039 + curr_token) % 4096 Injection: tok_emb(x) + eval_hash_emb(h), before RMSNorm Compliance: same score-first pattern as openai#549/openai#1413 TTT precedent. Precedent for eval-time params: LoRA-TTT (openai#1254, openai#1354). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:05:06Z

Community Review — Record: XSA + LoRA TTT (val_bpb=1.1070)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'modal'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'modal'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

Elarwei001 · 2026-04-13T12:54:56Z

Thanks for the careful review. I've fixed the import-time issue in the latest push.

The root cause was a top-level import modal, which could fail during the CPU smoke test environment before any training logic ran. I changed it so that modal is imported optionally, and the Modal-specific app/function entrypoints are only defined when modal is available.

I also re-ran a local compile/import smoke check on the updated train_gpt.py, and it now passes import successfully without modal installed.

Could you please re-run the compliance audit when convenient? Thank you again.

resouer mentioned this pull request Apr 8, 2026

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean) #1460

Closed

Fix optional Modal import for smoke tests

b8f8b76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: XSA + LoRA TTT (val_bpb=1.1070)#1254

Record: XSA + LoRA TTT (val_bpb=1.1070)#1254
Elarwei001 wants to merge 2 commits intoopenai:mainfrom
Elarwei001:elarwei-xsa-lora-ttt

Elarwei001 commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Elarwei001 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Elarwei001 commented Apr 2, 2026

Summary

Results

Approach

Acknowledgments & Attribution

Files

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Record: XSA + LoRA TTT (val_bpb=1.1070)

Uh oh!

Elarwei001 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading