Skip to content

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333

Open
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538
Open

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538

Conversation

@mahsumaktas
Copy link
Copy Markdown

Summary

Mean val_bpb: 1.1565 (3 seeds) | Best: 1.1538 (seed 1337) | Artifact: ~15.9 MB

23 GPU runs on 8xH100 SXM5. Systematic exploration of XSA, EMA vs SWA, depth recurrence, seq curriculum, LR/WD sweep, and MLP scaling.

Techniques

  • 11 transformer layers + XSA on last 4 layers
  • SmearGate + BigramHash(2048) + OrthoInit
  • INT6 per-row quantization + zstd-22 + FP16 tied embedding + Late-K FP16
  • SWA every 50 steps (fp32 accumulation) — bf16 causes catastrophic loss
  • Muon WD=0.04 + grad clip 0.3 + RoPE base 50K
  • Overtone SVD init + Phase-transition residual mixing
  • MLP 2.75x — sweet spot (3x exceeds 16MB with SmearGate at 11L)

Results (3 seeds)

Seed Sliding BPB Post-quant BPB Artifact
1337 1.1538 1.1766 15.99 MB
42 1.1565 1.1790 15.87 MB
7 1.1593 1.1820 15.93 MB
Mean 1.1565

Key Findings from 23 Runs

  • EMA(0.997) causes 0.14 BPB quant gap — SWA far better for our stack
  • 11L MLP 3x exceeds 16MB with SmearGate+BigramHash
  • SmearGate removal loses more than MLP 3x gains — bigram context matters
  • XSA needs GQA-compatible v expansion (repeat_interleave, bug found and fixed)
  • Seq curriculum doesn't work — SWA checkpoint incompatibility across seq lengths
  • Depth recurrence works but dim=640 too narrow; dim=768+ exceeds 16MB
  • Higher LR (0.03) improves BPB but worsens compression (larger weights)
  • Late QAT (75%) reduces quant gap (0.023 -> 0.006) but fewer steps

Run command

NUM_LAYERS=11 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=2048 \
TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=524288 MLP_MULT=2.75 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_WEIGHT_DECAY=0.04 WARMDOWN_ITERS=3000 \
SWA_ENABLED=1 SWA_EVERY=50 ROPE_BASE=50000 EVAL_STRIDE=64 \
python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • Runs reproducibly on 8xH100 SXM in under 10 minutes
  • Artifact under 16 MB (15.87-15.99 MB)
  • 3-seed validation (mean 1.1565, std 0.0028)
  • Sliding window eval completes within 10 minutes

Built with Claude Code

Mahsum and others added 2 commits March 20, 2026 10:19
…l_bpb=1.1754)

Combines 10 orthogonal improvements over the naive baseline:
- Per-row INT6 quantization + zstd-22 compression (13.98 MB artifact)
- FP16 tied embedding export (near-zero quantization gap)
- MLP 2.5x expansion
- SmearGate + BigramHash bigram-aware modules
- OrthoInit + muP scaling + phase-transition residual mixing
- Muon weight decay (0.02)
- Stochastic Weight Averaging (4 checkpoints)
- Sliding window evaluation (stride=64)
- Tuned hyperparameters (grad_clip=0.3, warmdown=3000)

8xH100 SXM, 9919 steps in 10 minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major upgrade from V1 to V2 with 23 GPU runs on 8xH100:
- 11 layers (was 9) + XSA on last 4 layers
- MLP 2.75x (was 2.5x) — sweet spot for 16MB at 11L
- RoPE base 50K, LR 0.025, SmearGate + BigramHash(2048)
- SWA/50 with fp32 accumulation (bf16 catastrophic fix)
- OrthoInit + Overtone SVD + Phase-transition residual mixing
- INT6 + zstd-22 + FP16 tied embed + Late-K FP16

3-seed validation: 1.1538 / 1.1565 / 1.1593 (mean 1.1565, std 0.0028)
Artifact: 15.87-15.99 MB (all under 16MB)

23 runs tested: EMA, depth recurrence, seq curriculum, LR sweep,
WD sweep, QAT, MLP 3x — documented in README.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 25, 2026
Two-stage investigation into training data selection for Parameter Golf:

Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most
reliable (rho=0.984). But all 80 shards have nearly identical bigram
statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise).

Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard
variance is 535x larger than between-shard. Selected top 12% by bigram CE
and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006).

Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model
perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero.

Conclusion: On FineWeb (already filtered), hard data selection trades
diversity for match quality, and diversity wins. Corroborated by PRs openai#737,
openai#623, openai#333 and Sachdeva et al. (ICLR 2025).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)

BPB: 1.1565 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 871fd864e7b3, file records/track_10min_16mb/2026-03-20_CombinedSOTA_INT6_SmearGate_BigramHash_SWA/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=66074 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=66074 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants