record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window by vmfunc · Pull Request #89 · openai/parameter-golf

vmfunc · 2026-03-19T15:16:56Z

mean val_bpb=1.1622 across 3 seeds on 8xH100 (1.1624, 1.1623, 1.1618). stacks six orthogonal improvements:

int6 STE, fake per-row int6 quantization during training w/ straight-through estimator. model learns to handle post-training quant. gap is only +0.002 bpb
fp16 embedding passthrough tied embed/logit head kept in fp16 instead of quantized, most quant-sensitive tensor, no STE protection
MLP 3x (1536 hidden) int6 compression frees enough artifact bytes to fit the wider model
NorMuon row-normalized Newton-Schulz updates (from modded-nanogpt) second-moment normalization on top of Muon
SWA over 7 checkpoints during warmdown
sliding window eval stride=64 (every scored token gets 960 tokens of context) ~0.033 bpb improvement

run	seed	steps	post-quant bpb	sliding window bpb
1	1337	11917	1.1956	1.1624
2	42	11925	1.1955	1.1623
3	2025	11917	1.1951	1.1618

artifact: 15.5MB (code 54KB + int6+zstd model 15.4MB). ~50ms/step, 600s wall clock

mean val_bpb=1.1622 across 3 seeds (1.1624, 1.1623, 1.1618). int6 fake quant w/ STE, fp16 embed passthrough, MLP 3x, NorMuon, stochastic weight averaging during warmdown, sliding window stride=64. 15.5MB artifact, 8xH100, 600s, ~12k steps.

NorMuon adds per-row second-moment tracking after Newton-Schulz orthogonalization, then normalizes and rescales to preserve total norm. Based on arXiv:2510.05491 and PR openai#89. Expected -0.005 to -0.010 BPB improvement. Drop-in replacement (same class name). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:10:19Z

Community Review — record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window

BPB: 1.1622 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA c887ef435460, file records/track_10min_16mb/2026-03-19_NorMuon_Int6STE_SWA_SlidingWindow/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=54361 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=54361 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

vmfunc force-pushed the submission/normuon-int6ste-swa-slidingwindow branch from f6a92be to c887ef4 Compare March 19, 2026 15:18

0hq added the record submission ready for review label Mar 19, 2026

mtybadger mentioned this pull request Mar 19, 2026

Record: Sliding Window Eval, 2048 Vocab Size, fp16 embeddings, SWA, NorMuon, FA3; mean_val_bpb:1.160 #122

Open

notapplica mentioned this pull request Mar 19, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

cocohearts added does not beat SOTA and removed record submission ready for review labels Mar 20, 2026

stukenov mentioned this pull request Mar 20, 2026

11L Int5-MLP + TTT-SGD + SmearGate + SWA (1.1455 BPB) #264

Open

5 tasks

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window#89

record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window#89
vmfunc wants to merge 1 commit intoopenai:mainfrom
vmfunc:submission/normuon-int6ste-swa-slidingwindow

vmfunc commented Mar 19, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vmfunc commented Mar 19, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants