Skip to content

PR #180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04#1

Open
joey00072 wants to merge 8 commits intomainfrom
interleaved-bigram
Open

PR #180 SOTA: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04#1
joey00072 wants to merge 8 commits intomainfrom
interleaved-bigram

Conversation

@joey00072
Copy link
Copy Markdown
Owner

Reproduces openai/parameter-golf PR openai#180.

Summary

PR openai#180 Techniques

Metrics

Seed val_bpb
42 1.14271
1337 1.14298
2024 1.14260
Mean 1.14276

🤖 Generated with Claude Code

joey00072 and others added 8 commits March 23, 2026 19:10
…0.04

Reproduce openai/parameter-golf PR openai#180 (val_bpb 1.14276, 3-seed mean).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace every other MLP (layers 0,2,4,6,8) with BigLU — an MLP where the
hidden state is gated by a per-layer bigram embedding (vocab=2048, dim=hidden,
expansion scale=1). Reduce mlp_mult 3.0→1.5 (hidden 1536→768) so total MLP
params stay identical to PR openai#180 (15.73M).

- Muon for up/down weights; AdamW for bigram embed tables (like main bigram)
- bigram.embed excluded from matrix_params to avoid Muon

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add BIGLU_MULT env var (default 1.0): biglu_hidden = biglu_mult * dim = 512
  independent of mlp_mult=3.0 (hidden=1536) — per-layer params identical
- Switch BigLU activation from relu² to F.silu
- run.sh/local.sh: drop MLP_MULT override, add BIGLU_MULT=1.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BigLU bigram embed tables use token_lr * biglu_bigram_lr_mult for faster
adaptation of the per-layer bigram gates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GRAD_ACCUM_STEPS env var overrides the default 8//world_size, enabling
single-GPU runs that match run.sh's effective batch (786432 tokens/step).
local.sh starts at GRAD_ACCUM_STEPS=64 (12 seqs×1024 micro-batch);
increment to 128/192/384/768 if OOM.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cal.sh

eval_val was incorrectly dividing val batch by grad_accum_steps (a training
concept irrelevant to eval), causing ValueError at high GRAD_ACCUM_STEPS.
local.sh: GRAD_ACCUM_STEPS=128 (6 seqs × 1024 micro-batch, fits in 7.6GB).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
FA3 (flash_attn_interface, SM90/H100) → FA2 (flash_attn, GQA only when
num_kv_heads==num_heads) → PyTorch SDPA fallback with enable_gqa.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant