Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100) by swapp1990 · Pull Request #2 · swapp1990/parameter-golf

swapp1990 · 2026-03-24T21:50:50Z

Summary

val_bpb: 1.1573 (LoRA TTT) | 15.02 MB artifact | 1xH100 PCIe, ~80 min
11-layer transformer: XSA (last 4 layers), SwiGLU 3x MLP, SmearGate, U-Net skips, OrthoInit, Muon WD=0.04, SWA
Mixed quantization: int5-MLP + int6-attn + int8-embed + zstd-22
Score-then-train LoRA TTT (rank-8, 256-token chunks) brings val_bpb from 1.191 → 1.157
18 experiments over 5 days, from val_bpb=3.10 to 1.1573 (~$50 total compute)

Why Non-Record

Trained on 1xH100 PCIe with grad accumulation (~80 min), not 8xH100 in 10 min. Architecture is identical to what would run on 8xH100.

Test plan

Full training pipeline validated on 2xH100 dry run
Mixed quantization fits in 15.02 MB (< 16 MB)
LoRA TTT parallelized across GPUs with all_reduce
Score-then-train ordering verified (legal per PR Record: PROTEUS v8 — 11L INT6 + LoRA TTT 5ep cosine (mean val_bpb=0.7853, 4 seeds) openai/parameter-golf#568 ruling)
Pending: 8xH100 record run when spot capacity available

🤖 Generated with Claude Code

MLX Timing Mismatch with Main Script

Update README typo

Fix MLX multi-batch validation memory growth

Update README.md

## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval **val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB) Four orthogonal improvements over the naive baseline: 1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization 2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB. 3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes 4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost ### Run command ```bash RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \ torchrun --standalone --nproc_per_node=8 train_gpt.py ``` ### Key metrics | Metric | Value | |--------|-------| | Steps (10 min cap) | 12,395 | | int6/int8 sliding val_bpb | **1.1630** | | Quantization penalty | +0.0015 BPB | | Artifact size | 15,353,490 bytes |

… 1.2129) 10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129 across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats (t=34.12, p<<0.001). Key changes: - 10 layers (vs 9 baseline) - Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - FP16 tied embedding export (reduces quant gap) - Int6 quantization for middle layers 2-7 (fits under 16MB) Mean artifact size: 15.36MB (under 16MB cap). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…aluating the graph after each sub-batch step

Use eager mx.eval() to fix running train script on 16GB Mac devices

keep tok_emb.weight in fp16 during int8 export (kills the quant gap), shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600 and matrix LR to 0.06. tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* SOTA attempt * Improve score on SXM --------- Co-authored-by: spokane-way <spokane@way>

Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB). Key changes: - 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params) - QAT: STE fake-quantize simulates int6 during training - Int6 quantization on all block weights (layers 0-8) - Sliding window eval (stride=64) for ~0.033 BPB free gain - FP16 tied embedding + lower LRs (carried over) 5-seed results on 8xH100 SXM: Mean slide_bpb: 1.1652 (std=0.0017) Mean rt_bpb: 1.1985 t-statistic: 78.93 (p << 0.001) All artifacts under 16MB (mean: 15.64MB) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The window_starts filter dropped windows shorter than stride, silently skipping up to (stride-1) tokens at the end of the validation set. Now includes all windows with >= 1 scoreable token, and clamps the score start for short final windows.

Co-authored-by: spokane-way <spokane@way>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stopher-Lee-McClendon

…stopher-Lee-McClendon

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

…-1.1233 Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

…oard-merged-records Update README leaderboard with merged record submissions

…u-legal-ttt-1.1183 Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)

Update README.md

…U-Net + NeoMuon + 4xrelu²MLP + Smear + Fact Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8 + Bit-packing LZMA + Stride-16 Eval - 2h (openai#641) * Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary U-Net (15L 768d 8192BPE relu² 4xMLP FP8 SmearGate, 50k steps) * Updated README.md for Non-record submission. --------- Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

…What Works, What Doesn't, and Why (openai#363) * Non-record: depth recurrence + quantization error amplification finding 4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP BigramHash + XSA + LoRA + Late STE QAT + int8+zstd Key finding: quantization error amplifies ~900x through recurrence cycles, making int6 incompatible with weight-sharing architectures. Int8 for shared blocks reduces the gap from 1.14 to 0.37 bpb. 3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8) * docs: comprehensive depth recurrence research writeup Complete 4-day experimental report on looped transformers in Parameter Golf: - Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap) - Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb - 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins - 12 negative results with specific numbers - Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip) - Updated training script with all experimental features * Update README.md me when I cant write * fix: remove extra files, update writeup per reviewer feedback - Remove pr325_train_gpt.py from PR (dev file, not submission) - Restore original README.md - Update records/ writeup with v2 content - Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_) - Clarify T=0.90 is activation-dependent (relu² specific, found via grid search) --------- Co-authored-by: Evangeline Kamin <eve@aurora.lan>

…, 1xH100) 11-layer transformer with XSA, SwiGLU, SmearGate, and score-first LoRA TTT. Trained on 1xH100 PCIe (~80 min). val_bpb: 1.1573, artifact: 15.02 MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0hq and others added 30 commits March 18, 2026 09:33

Update README.md

fe16a5a

Remove scripts

0c0ea98

Update README typo

cbd940a

match timing to main script to exclude eval timing

3d1c8e6

Fix MLX validation loss accumulation

321e82c

Log MLX validation progress

e17ed01

Merge pull request openai#18 from berniwal/main

5472f29

MLX Timing Mismatch with Main Script

Merge pull request openai#9 from oof-baroomf/patch-1

09c3e8e

Update README typo

Merge pull request openai#32 from yhn112/fix-mlx-eval-memory-growth

8253577

Fix MLX multi-batch validation memory growth

Update README.md

9f170d4

Merge pull request openai#35 from openai/0hq-patch-1

0f9518a

Update README.md

Update train_gpt.py

886cc5b

Update train_gpt_mlx.py

de13248

Update README.md

954a158

Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808

9d318e7

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)

b8a1426

Add MLX_EAGER_EVAL flag to further reduce memory pressure by force-ev…

6a08c9d

…aluating the graph after each sub-batch step

Merge pull request openai#100 from sandsevenone/mlx_eager_eval

2081ba1

Use eager mx.eval() to fix running train script on 16GB Mac devices

Update README.md (openai#105)

b5ea566

clarify torch version

6e3e90d

SOTA attempt (val_bpb=1.2064) (openai#49)

e89fcf8

* SOTA attempt * Improve score on SXM --------- Co-authored-by: spokane-way <spokane@way>

Update README.md

194bb87

Add record: Sliding Window Eval (stride=64), val_bpb=1.1925 (openai#50)

d84a3e8

Update README.md

6b40978

New SOTA attempt (openai#52)

78c24e2

Co-authored-by: spokane-way <spokane@way>

abaybektursun and others added 25 commits March 23, 2026 11:27

Fix pre-TTT BPB, TTT gains, and steps to match logs exactly

139d573

Fix author attributions: PR openai#493 @parinzee, PR openai#461 @Chri…

b08d72a

…stopher-Lee-McClendon

Merge pull request openai#265 from unnir/submission/v22-XSA3-beats-top1

56a9283

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

Merge pull request openai#287 from jfprincz/submission/11l-xsa4-ema-i…

0d44464

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

Merge pull request openai#315 from jfprincz/submission/11l-partialrop…

cdabe13

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

Merge pull request openai#414 from signalrush/submission/ema-gptqlite…

b5ac0de

…-1.1233 Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

Update README leaderboard with merged records

b82c50d

Use GitHub usernames in new leaderboard rows

d74c0b5

Describe leaderboard entries by base-run diff

8a77849

Merge pull request openai#561 from openai/codex/update-readme-leaderb…

ebda3af

…oard-merged-records Update README leaderboard with merged record submissions

Merge pull request openai#549 from abaybektursun/submission/leaky-rel…

2377f43

…u-legal-ttt-1.1183 Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)

Update README.md

69050b3

Merge pull request openai#616 from openai/valerio-oai-patch-1

91b26be

Update README.md

Update README.md

630bb5e

Update README.md

499d002

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net (10L 768d 8192BPE…

69bc84e

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

Update README.md

226d817

Update README.md

dc57d78

Update README.md

499a606

Update README.md

9f44bc6

Update README.md

0e5b198

swapp1990 force-pushed the submission/nonrecord-11l-xsa-lora-ttt branch from 1f5091a to 370a048 Compare March 29, 2026 18:14

swapp1990 and others added 3 commits March 29, 2026 11:17

Add technical report placeholder and link from README

3c97468

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Link technical report to fork repo, remove placeholder from submission

b0d7261

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add comprehensive technical report for 11L XSA + LoRA TTT submission

cb7f417

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

swapp1990 force-pushed the main branch from 678cd53 to de68449 Compare April 6, 2026 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2
swapp1990 wants to merge 109 commits intomainfrom
submission/nonrecord-11l-xsa-lora-ttt

swapp1990 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

swapp1990 commented Mar 24, 2026

Summary

Why Non-Record

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants