Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510
Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510SelfAnush wants to merge 4 commits intoopenai:mainfrom
Conversation
… slow TRSM on H100 Non-record: MUD optimizer (arxiv:2603.17970) Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram preconditioning. Single seed (42) on 8xH100 SXM. Results: - val_bpb: 1.1989 (sliding window eval, stride=64) - Steps: 5,087 in 10 min - step_avg: 118ms (4.5x slower than Muon's ~26ms on H100) Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP savings reported in the paper (tested on A100/MI250/GH200). Built on SOTA by @thwu1 (PR openai#180). Paper: https://arxiv.org/abs/2603.17970
Community Review — Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)BPB: 1.1989 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=10, vocab=1024, code=53640 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=10, vocab=1024, code=53640 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Replaces Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram
preconditioning (Algorithm 2 from arxiv:2603.17970, Southworth & Thomas, Mar 2026).
Single seed (42) on 8xH100 SXM. Marked as non-record due to throughput issue
on H100s.
Results
Convergence Curve
vs. Muon SOTA (PR #180)
Key Finding
MUD achieves strong convergence (1.1989 BPB in only 5,087 steps) but is
4.5x slower per step than Muon on H100s. The paper's throughput claims
(1.3-2.6x over Muon) were measured on A100/MI250/GH200 —
torch.linalg.solve_triangularon H100 CUDA is not as well-optimized as GEMM on Hopper architecture.
If MUD could match Muon's step time, extrapolating the convergence curve
suggests it could reach ~1.10 BPB in 20,000 steps.
What changed
Only the optimizer:
mud_whiten()replaceszeropower_via_newtonschulz5().Everything else (architecture, quantization, training loop) is identical to
SOTA by @thwu1 (PR #180).
References