You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reformulating gradient descent: Muon is NOT provably optimal in general
Muon solves argmax tr(G^TX) s.t. ||X||₂≤1. Optimal for THAT sub-problem,
but the sub-problem may be wrong:
- Gradient is noisy (correlation with noise isn't ideal)
- Spectral norm constraint is arbitrary (rank-k could be better)
- No curvature info (high-curvature dirs should get SMALL updates)
- Layers treated independently (but they're coupled)
Two novel reformulations:
1. Low-rank update (rank-32 via power iteration): filters noise,
60× cheaper per step than NS5 if signal is in top-32 directions
2. Spectral momentum: EMA of G^TG covariance → denoised whitening
Running 200-step screen.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments