11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds) by mahsumaktas · Pull Request #333 · openai/parameter-golf

mahsumaktas · 2026-03-21T10:39:47Z

Summary

Mean val_bpb: 1.1565 (3 seeds) | Best: 1.1538 (seed 1337) | Artifact: ~15.9 MB

23 GPU runs on 8xH100 SXM5. Systematic exploration of XSA, EMA vs SWA, depth recurrence, seq curriculum, LR/WD sweep, and MLP scaling.

Techniques

11 transformer layers + XSA on last 4 layers
SmearGate + BigramHash(2048) + OrthoInit
INT6 per-row quantization + zstd-22 + FP16 tied embedding + Late-K FP16
SWA every 50 steps (fp32 accumulation) — bf16 causes catastrophic loss
Muon WD=0.04 + grad clip 0.3 + RoPE base 50K
Overtone SVD init + Phase-transition residual mixing
MLP 2.75x — sweet spot (3x exceeds 16MB with SmearGate at 11L)

Results (3 seeds)

Seed	Sliding BPB	Post-quant BPB	Artifact
1337	1.1538	1.1766	15.99 MB
42	1.1565	1.1790	15.87 MB
7	1.1593	1.1820	15.93 MB
Mean	1.1565

Key Findings from 23 Runs

EMA(0.997) causes 0.14 BPB quant gap — SWA far better for our stack
11L MLP 3x exceeds 16MB with SmearGate+BigramHash
SmearGate removal loses more than MLP 3x gains — bigram context matters
XSA needs GQA-compatible v expansion (repeat_interleave, bug found and fixed)
Seq curriculum doesn't work — SWA checkpoint incompatibility across seq lengths
Depth recurrence works but dim=640 too narrow; dim=768+ exceeds 16MB
Higher LR (0.03) improves BPB but worsens compression (larger weights)
Late QAT (75%) reduces quant gap (0.023 -> 0.006) but fewer steps

Run command

NUM_LAYERS=11 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=2048 \
TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=524288 MLP_MULT=2.75 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_WEIGHT_DECAY=0.04 WARMDOWN_ITERS=3000 \
SWA_ENABLED=1 SWA_EVERY=50 ROPE_BASE=50000 EVAL_STRIDE=64 \
python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Test plan

Runs reproducibly on 8xH100 SXM in under 10 minutes
Artifact under 16 MB (15.87-15.99 MB)
3-seed validation (mean 1.1565, std 0.0028)
Sliding window eval completes within 10 minutes

Built with Claude Code

…l_bpb=1.1754) Combines 10 orthogonal improvements over the naive baseline: - Per-row INT6 quantization + zstd-22 compression (13.98 MB artifact) - FP16 tied embedding export (near-zero quantization gap) - MLP 2.5x expansion - SmearGate + BigramHash bigram-aware modules - OrthoInit + muP scaling + phase-transition residual mixing - Muon weight decay (0.02) - Stochastic Weight Averaging (4 checkpoints) - Sliding window evaluation (stride=64) - Tuned hyperparameters (grad_clip=0.3, warmdown=3000) 8xH100 SXM, 9919 steps in 10 minutes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major upgrade from V1 to V2 with 23 GPU runs on 8xH100: - 11 layers (was 9) + XSA on last 4 layers - MLP 2.75x (was 2.5x) — sweet spot for 16MB at 11L - RoPE base 50K, LR 0.025, SmearGate + BigramHash(2048) - SWA/50 with fp32 accumulation (bf16 catastrophic fix) - OrthoInit + Overtone SVD + Phase-transition residual mixing - INT6 + zstd-22 + FP16 tied embed + Late-K FP16 3-seed validation: 1.1538 / 1.1565 / 1.1593 (mean 1.1565, std 0.0028) Artifact: 15.87-15.99 MB (all under 16MB) 23 runs tested: EMA, depth recurrence, seq curriculum, LR sweep, WD sweep, QAT, MLP 3x — documented in README. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two-stage investigation into training data selection for Parameter Golf: Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most reliable (rho=0.984). But all 80 shards have nearly identical bigram statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise). Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard variance is 535x larger than between-shard. Selected top 12% by bigram CE and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006). Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero. Conclusion: On FineWeb (already filtered), hard data selection trades diversity for match quality, and diversity wins. Corroborated by PRs openai#737, openai#623, openai#333 and Sachdeva et al. (ICLR 2025). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:09:46Z

Community Review — 11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)

BPB: 1.1565 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 871fd864e7b3, file records/track_10min_16mb/2026-03-20_CombinedSOTA_INT6_SmearGate_BigramHash_SWA/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=66074 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=66074 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Mahsum and others added 2 commits March 20, 2026 10:19

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

abaybektursun mentioned this pull request Mar 25, 2026

Non-record: Data ordering & selection — negative result on FineWeb #772

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538

mahsumaktas commented Mar 21, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mahsumaktas commented Mar 21, 2026

Summary

Techniques

Results (3 seeds)

Key Findings from 23 Runs

Run command

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — 11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants