PR #180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 by joey00072 · Pull Request #1 · joey00072/parameter-golf

joey00072 · 2026-03-23T13:40:24Z

Reproduces openai/parameter-golf PR openai#180.

Summary

Exact copy of PR Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) openai/parameter-golf#180 train_gpt.py (val_bpb 1.14276, 3-seed mean)
run.sh for 8xH100 full reproduction (SEED=1337 bash run.sh for specific seed)
local.sh for 1-GPU local test (reduced batch/wallclock)

PR openai#180 Techniques

10 layers (vs 9 baseline) — funded by int5 MLP savings
Mixed int5 MLP / int6 attention quantization (1.86MB savings → extra layer)
BigramHash(10240, dim=128) — larger bucket count reduces hash collisions
SWA start_frac=0.4, every=50 steps (~24 checkpoints averaged)
MuonWD=0.04, AdamWD=0.04
SmearGate + OrthoInit (from PR Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) openai/parameter-golf#162)

Metrics

Seed	val_bpb
42	1.14271
1337	1.14298
2024	1.14260
Mean	1.14276

🤖 Generated with Claude Code

…0.04 Reproduce openai/parameter-golf PR openai#180 (val_bpb 1.14276, 3-seed mean). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace every other MLP (layers 0,2,4,6,8) with BigLU — an MLP where the hidden state is gated by a per-layer bigram embedding (vocab=2048, dim=hidden, expansion scale=1). Reduce mlp_mult 3.0→1.5 (hidden 1536→768) so total MLP params stay identical to PR openai#180 (15.73M). - Muon for up/down weights; AdamW for bigram embed tables (like main bigram) - bigram.embed excluded from matrix_params to avoid Muon Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add BIGLU_MULT env var (default 1.0): biglu_hidden = biglu_mult * dim = 512 independent of mlp_mult=3.0 (hidden=1536) — per-layer params identical - Switch BigLU activation from relu² to F.silu - run.sh/local.sh: drop MLP_MULT override, add BIGLU_MULT=1.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BigLU bigram embed tables use token_lr * biglu_bigram_lr_mult for faster adaptation of the per-layer bigram gates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GRAD_ACCUM_STEPS env var overrides the default 8//world_size, enabling single-GPU runs that match run.sh's effective batch (786432 tokens/step). local.sh starts at GRAD_ACCUM_STEPS=64 (12 seqs×1024 micro-batch); increment to 128/192/384/768 if OOM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cal.sh eval_val was incorrectly dividing val batch by grad_accum_steps (a training concept irrelevant to eval), causing ValueError at high GRAD_ACCUM_STEPS. local.sh: GRAD_ACCUM_STEPS=128 (6 seqs × 1024 micro-batch, fits in 7.6GB). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

FA3 (flash_attn_interface, SM90/H100) → FA2 (flash_attn, GQA only when num_kv_heads==num_heads) → PyTorch SDPA fallback with enable_gqa. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joey00072 and others added 8 commits March 23, 2026 19:10

PR openai#180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=…

e587d76

…0.04 Reproduce openai/parameter-golf PR openai#180 (val_bpb 1.14276, 3-seed mean). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add BIGLU_BIGRAM_LR_MULT hyperparameter (default 10x)

39da2eb

BigLU bigram embed tables use token_lr * biglu_bigram_lr_mult for faster adaptation of the per-layer bigram gates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Revert local.sh to small-batch fast-iteration version

5ffcd5d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add FA3/FA2 fallback flash attention

f4aa894

FA3 (flash_attn_interface, SM90/H100) → FA2 (flash_attn, GQA only when num_kv_heads==num_heads) → PyTorch SDPA fallback with enable_gqa. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR #180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04#1

PR #180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04#1
joey00072 wants to merge 8 commits intomainfrom
interleaved-bigram

joey00072 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joey00072 commented Mar 23, 2026

Summary

PR openai#180 Techniques

Metrics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant