Skip to content

Commit 76b53c1

Browse files
yuyeonclaude
andcommitted
Reformulating gradient descent: Muon is NOT provably optimal in general
Muon solves argmax tr(G^TX) s.t. ||X||₂≤1. Optimal for THAT sub-problem, but the sub-problem may be wrong: - Gradient is noisy (correlation with noise isn't ideal) - Spectral norm constraint is arbitrary (rank-k could be better) - No curvature info (high-curvature dirs should get SMALL updates) - Layers treated independently (but they're coupled) Two novel reformulations: 1. Low-rank update (rank-32 via power iteration): filters noise, 60× cheaper per step than NS5 if signal is in top-32 directions 2. Spectral momentum: EMA of G^TG covariance → denoised whitening Running 200-step screen. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 3cbb648 commit 76b53c1

2 files changed

Lines changed: 1668 additions & 7 deletions

File tree

docs/novel_optimizers.md

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -96,10 +96,40 @@ Any alternative must be expressible as a SHORT SEQUENCE OF MATRIX
9696
MULTIPLICATIONS to compete with NS5 on GPU. QR, SVD, Cholesky all
9797
use sequential operations that GPUs handle poorly.
9898

99-
### What COULD beat Muon (theoretical)
100-
101-
Information sources beyond the current gradient:
102-
1. Gradient HISTORY (momentum does this, but in element space not spectral space)
103-
2. Loss landscape curvature (2nd-order info — too expensive)
104-
3. Cross-layer gradient correlations (unexplored, could help)
105-
4. Data distribution statistics (batch-level information)
99+
### Correction: Muon is NOT "provably optimal" in general
100+
101+
Muon solves: argmax tr(G^T X) s.t. ||X||₂ ≤ 1.
102+
This is optimal for THAT specific objective. But:
103+
104+
1. **The gradient is noisy.** Maximizing correlation with a noisy signal isn't ideal.
105+
A denoised gradient (running average of spectral structure) might be better.
106+
2. **The spectral norm constraint is arbitrary.** Why not rank-k? Nuclear norm?
107+
A rank-k update filters noise — if gradient signal is in top-k directions,
108+
the rest is noise that Muon preserves.
109+
3. **Layers are coupled.** Muon treats each layer independently. The optimal update
110+
for W₁ depends on what update W₂ gets.
111+
4. **No curvature information.** High-gradient, high-curvature directions get large
112+
updates from Muon but should get SMALL updates (Newton direction).
113+
114+
### Reformulating gradient descent: alternative axioms
115+
116+
Muon's axioms: update should be (1) orthogonal, (2) max-correlated with G, (3) memoryless.
117+
Changing any axiom gives a different optimizer:
118+
119+
**Change axiom 1 (orthogonal → low-rank):**
120+
Rank-k update via power iteration: U_k V_k^T from top-k singular vectors.
121+
Cost: O(k × mn) — potentially 60× cheaper than NS5 for k=32.
122+
Hypothesis: top-32 singular directions capture the gradient signal, rest is noise.
123+
124+
**Change axiom 2 (single-step → denoised):**
125+
Spectral momentum: maintain running EMA of G^TG covariance.
126+
Whiten gradient using averaged covariance instead of single-step.
127+
This gives G @ C_ema^{-1/2} ≈ denoised UV^T.
128+
129+
**Change axiom 3 (memoryless → history-aware):**
130+
Track the persistent subspace across steps. Update only in directions
131+
that are consistently important (appear in gradient's top-k for multiple steps).
132+
133+
### v3 experiments (running)
134+
1. **Low-rank (k=32):** Replace NS5 with 3-step power iteration for top-32 SVD.
135+
2. **Spectral momentum:** EMA of G^TG → NS5 on averaged covariance → whiten G.

0 commit comments

Comments
 (0)