Skip to content

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510

Open
SelfAnush wants to merge 4 commits intoopenai:mainfrom
SelfAnush:mud-optimizer-submission
Open

Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)#510
SelfAnush wants to merge 4 commits intoopenai:mainfrom
SelfAnush:mud-optimizer-submission

Conversation

@SelfAnush
Copy link
Copy Markdown

Summary

Replaces Muon's 5-step Newton-Schulz iteration with MUD's triangular Gram
preconditioning (Algorithm 2 from arxiv:2603.17970, Southworth & Thomas, Mar 2026).
Single seed (42) on 8xH100 SXM. Marked as non-record due to throughput issue
on H100s.

Results

Metric Value
Final val_bpb 1.1989
Final val_loss 2.0243
Steps in 10 min 5,087
step_avg 118ms
Peak memory 18,866 MiB

Convergence Curve

Step val_bpb
500 1.4604
1000 1.3649
2000 1.3191
3000 1.2647
4000 1.2291
5000 1.1945
Final (post-quant) 1.1989

vs. Muon SOTA (PR #180)

Metric Muon MUD (this)
step_avg ~26ms 118ms
Steps in 10 min ~20,000 5,087
Final val_bpb 1.1428 1.1989

Key Finding

MUD achieves strong convergence (1.1989 BPB in only 5,087 steps) but is
4.5x slower per step than Muon on H100s. The paper's throughput claims
(1.3-2.6x over Muon) were measured on A100/MI250/GH200 — torch.linalg.solve_triangular
on H100 CUDA is not as well-optimized as GEMM on Hopper architecture.
If MUD could match Muon's step time, extrapolating the convergence curve
suggests it could reach ~1.10 BPB in 20,000 steps.

What changed

Only the optimizer: mud_whiten() replaces zeropower_via_newtonschulz5().
Everything else (architecture, quantization, training loop) is identical to
SOTA by @thwu1 (PR #180).

References

… slow TRSM on H100

Non-record: MUD optimizer (arxiv:2603.17970)
Replaces Muon's 5-step Newton-Schulz with MUD's triangular Gram
preconditioning. Single seed (42) on 8xH100 SXM.
Results:
- val_bpb: 1.1989 (sliding window eval, stride=64)
- Steps: 5,087 in 10 min
- step_avg: 118ms (4.5x slower than Muon's ~26ms on H100)
Key finding: Strong convergence (within 0.056 BPB of SOTA with 4x
fewer steps) but TRSM overhead on H100 CUDA negates the 12x FLOP
savings reported in the paper (tested on A100/MI250/GH200).
Built on SOTA by @thwu1 (PR openai#180).
Paper: https://arxiv.org/abs/2603.17970
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: MUD optimizer — triangular Gram preconditioning (arxiv:2603.17970)

BPB: 1.1989 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 530f77f43048, file records/track_non_record_16mb/2026-03-22_MUD_Int5MLP_BigramHash_SWA/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=10, vocab=1024, code=53640 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=10, vocab=1024, code=53640 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants