Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970) by SelfAnush · Pull Request #510 · openai/parameter-golf

SelfAnush · 2026-03-23T07:22:27Z

Summary

Replaces Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram
preconditioning (Algorithm 2 from arxiv:2603.17970, Southworth & Thomas, Mar 2026).
Single seed (42) on 8xH100 SXM. Marked as non-record due to throughput issue
on H100s.

Results

Metric	Value
Final val_bpb	1.1989
Final val_loss	2.0243
Steps in 10 min	5,087
step_avg	118ms
Peak memory	18,866 MiB

Convergence Curve

Step	val_bpb
500	1.4604
1000	1.3649
2000	1.3191
3000	1.2647
4000	1.2291
5000	1.1945
Final (post-quant)	1.1989

vs. Muon SOTA (PR #180)

Metric	Muon	MUD (this)
step_avg	~26ms	118ms
Steps in 10 min	~20,000	5,087
Final val_bpb	1.1428	1.1989

Key Finding

MUD achieves strong convergence (1.1989 BPB in only 5,087 steps) but is
4.5x slower per step than Muon on H100s. The paper's throughput claims
(1.3-2.6x over Muon) were measured on A100/MI250/GH200 — torch.linalg.solve_triangular
on H100 CUDA is not as well-optimized as GEMM on Hopper architecture.
If MUD could match Muon's step time, extrapolating the convergence curve
suggests it could reach ~1.10 BPB in 20,000 steps.

What changed

Only the optimizer: mud_whiten() replaces zeropower_via_newtonschulz5().
Everything else (architecture, quantization, training loop) is identical to
SOTA by @thwu1 (PR #180).

References

@thwu1

… slow TRSM on H100 Non-record: MUD optimizer (arxiv:2603.17970) Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram preconditioning. Single seed (42) on 8xH100 SXM. Results: - val_bpb: 1.1989 (sliding window eval, stride=64) - Steps: 5,087 in 10 min - step_avg: 118ms (4.5x slower than Muon's ~26ms on H100) Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP savings reported in the paper (tested on A100/MI250/GH200). Built on SOTA by @thwu1 (PR openai#180). Paper: https://arxiv.org/abs/2603.17970

MatoTeziTanka · 2026-04-11T20:11:43Z

Community Review — Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)

BPB: 1.1989 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 530f77f43048, file records/track_non_record_16mb/2026-03-22_MUD_Int5MLP_BigramHash_SWA/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=10, vocab=1024, code=53640 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=10, vocab=1024, code=53640 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

SelfAnush added 4 commits March 23, 2026 08:23

Add MUD optimizer submission (arxiv:2603.17970)

2116281

fixes

d37ba3e

Fix: use float32 for solve_triangular (CUDA bfloat16 not supported)

232d4d4

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510
SelfAnush wants to merge 4 commits intoopenai:mainfrom
SelfAnush:mud-optimizer-submission

SelfAnush commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SelfAnush commented Mar 23, 2026

Summary

Results

Convergence Curve

vs. Muon SOTA (PR #180)

Key Finding

What changed

References

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants