Record: XSA + LoRA TTT (val_bpb=1.1070)#1254
Conversation
Author: Elar Wei (@Elarwei001) val_bpb: 1.1070 Model size: 14.4 MB Hardware: 8×H100 SXM Techniques: - XSA (Exclusive Self Attention) on all 11 layers - LoRA TTT (Test-Time Training) with rank=8 - QAT Int6 quantization - BPE-8192 tokenizer Attribution: - @sproos (BPE-8192 tokenizer) - @LoquiAuris, @MatoTeziTanka (LoRA TTT) - @jfprincz, @unnir (XSA) - @abaybektursun (LeakyReLU) - @signalrush (Int6 QAT) - @raahilshah, @thwu1 (Training stack)
Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- LoRA rank=4 on Q/K/V/O projections of last 4 layers (blocks 7-10) - SGD momentum=0.9, lr=0.002 with cosine decay across chunks - Per-block discriminative LR: block 7 at 0.6x, blocks 8-10 at 1.0x - Score-first: score chunk under inference_mode before training LoRA - 2 epochs per chunk, ~57K LoRA params total - Based on PR openai#1254 LoRA pattern + PR openai#549 score-first loop
Novel mechanism: zero-initialized nn.Embedding(4096, 512) created at eval time, trained exclusively through the standard score-first TTT loop. Learns document-local bigram patterns without modifying any artifact weights. Hash: h = (prev_token * 2039 + curr_token) % 4096 Injection: tok_emb(x) + eval_hash_emb(h), before RMSNorm Compliance: same score-first pattern as openai#549/openai#1413 TTT precedent. Precedent for eval-time params: LoRA-TTT (openai#1254, openai#1354). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: XSA + LoRA TTT (val_bpb=1.1070)Compliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'modal'. Classification via |
|
Thanks for the careful review. I've fixed the import-time issue in the latest push. The root cause was a top-level I also re-ran a local compile/import smoke check on the updated Could you please re-run the compliance audit when convenient? Thank you again. |
Summary
Author: Elar Wei (@Elarwei001)
val_bpb: 1.1070
Artifact size: 14.4 MB (compressed with zlib)
Training time: ~9 min on 8×H100
Results
Approach
Acknowledgments & Attribution
This submission builds upon the excellent work of the Parameter Golf community:
Files
records/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/README.md— Detailed documentationrecords/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/submission.json— Metadatarecords/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/train_gpt.py— Training scriptrecords/track_10min_16mb/2026-04-02_XSA_LoRA_TTT/train_seed42.log— Training logSpecial thanks to the entire Parameter Golf community for sharing techniques openly!