research fire openai#10: Patch 18 USE_MUONEQ_R — row-only normalization for Muon optimizer

Takoda Mundy · claude · Takoda Mundy · commit c0c22da1239a · 2026-04-08T02:22:42.000+10:00
From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration" (Mar 2026). Used in 40+ openai/parameter-golf PRs, top record PR openai#1260 = val_bpb 1.0929 (3-seed mean). Inserts row normalization between Patch 17 Mousse block and Newton-Schulz: row_norm[i] = sqrt(sum_j G[i,j]^2) G[i,j] = G[i,j] / row_norm[i] Distinct from Mousse: Mousse is row+col (G/||row||/||col||), MuonEq-R is row-only (G/||row||). They can stack independently. Gated by USE_MUONEQ_R=1, falls back gracefully when unset. 4 MR experiments queued for validation: MR0_alone, MR1_plus_leaky_ng, MR2_seed42, MR3_mousse_plus_muoneqr This is the second optimizer-side patch in two fires. Both patches fit our train_loss metric so they can validate on cheap GPU loop without H100 escalation. If either lands within champion noise band 3.27-3.30, defensible ship for final stack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/RESEARCH_LOG.md b/RESEARCH_LOG.md
@@ -809,3 +809,61 @@ This is why I overrode the subagent's cautious PASS. **First proper shippable pa
 ### Validation plan
 
 Loop will pick up the new patch on next git pull (~5 min). MS family experiments will run within the next 2 hours via the runner cycle. Check on next monitor fire (~16:00 UTC) to see if MS1/MS2/MS3 land below 3.30 (within champion range) — if YES, Mousse is validated for H100 escalation bundle. If NO, we have evidence that even the simplified Mousse doesn't help at our 22M scale (a useful negative result either way).
+
+---
+
+## Research Fire #10 — 2026-04-08 (cron min :16, Track A) — Patch 18 USE_MUONEQ_R SHIPPED
+
+**Subject**: Continue the optimizer-side vector after Patch 17 USE_MOUSSE success. Investigate "MuonEq-R" referenced in PR #1423 (1.0791 BPB) and many other top open submissions but never extracted.
+
+### Subagent finding
+
+**MuonEq-R = row-only normalization before Newton-Schulz**. From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration" (Mar 30, 2026). Used in **40+ openai/parameter-golf PRs**, top record PR #1260 at val_bpb 1.0929 (3-seed mean).
+
+**Formula**:
+```
+row_norm[i] = sqrt(sum_j G[i,j]^2)        # L2 norm of row i
+G_normalized[i,j] = G[i,j] / row_norm[i]  # divide each row by its norm
+```
+Then standard Newton-Schulz on G_normalized. Each row of the result has unit L2 norm.
+
+**Distinct from Patch 17 Mousse**: Mousse is row+col preconditioning (`G/(||row||*||col||)`), MuonEq-R is row-only (`G/||row||`). They are mathematically different and can stack independently. PR #1440 stacks both: Mousse first, then MuonEq-R, then NS5.
+
+### Why I shipped this fire (no override needed — subagent agreed)
+
+1. **Optimizer-side → fits our train_loss metric** (same reasoning as Mousse). We can validate on the cheap-GPU loop within ONE cycle after the runner pulls.
+2. **5 LOC implementation** — same anchor strategy as Patch 17, contained inside the Muon optimizer step body.
+3. **40+ PRs use it** — the highest-confidence port we've found in any research fire. PR #1260 specifically attributes +0.001 BPB to MuonEq-R alone.
+4. **Stacks with Mousse** — we can run them independently, together, or against each other. Four experiments queued.
+5. **Same risk profile as Patch 17** — gated, contained, falls back gracefully.
+
+### Patch 18 USE_MUONEQ_R — code shipped this fire
+
+Inserted between the Mousse block (Patch 17) and the Newton-Schulz call:
+```python
+                    # MUONEQ_R_MARKER: optional row-only normalization (arxiv:2603.28254)
+                    if int(os.environ.get("USE_MUONEQ_R", "0")):
+                        _row_norm = g.norm(dim=-1, keepdim=True).clamp(min=1e-8)
+                        g = g / _row_norm
+                    g = zeropower_via_newtonschulz5(g, steps=backend_steps)
+```
+
+5 lines of actual code. Marker MUONEQ_R_MARKER. Anchored on the same `g = zeropower_via_newtonschulz5(g, steps=backend_steps)` line that Patch 17 ended its block with — string replacement finds the line at the end of Patch 17's block and inserts the MuonEq-R block before it.
+
+### Experiments queued (4 added → queue is now 28)
+
+- **MR0_muoneqr_alone** — pure MuonEq-R without n-gram bias, isolation test
+- **MR1_muoneqr_plus_leaky_ng** — MuonEq-R + leaky_relu + L5 weights (champion config)
+- **MR2_muoneqr_seed42** — multi-seed validation of MR1
+- **MR3_mousse_plus_muoneqr** — STACK BOTH Mousse + MuonEq-R + leaky_relu + L5 weights — measures the additive value vs either alone
+
+The MR3 stacked experiment is the most interesting — if it lands below MS1 (Mousse alone) AND below MR1 (MuonEq-R alone), then the two patches genuinely stack at our scale.
+
+### Two optimizer-side patches in flight
+
+Total optimizer-side experiments now in queue:
+- 4 MS experiments (Patch 17 Mousse) — currently MS1 in flight, MS2/MS3 next
+- 4 MR experiments (Patch 18 MuonEq-R) — will fire after MS family completes
+- 8 experiments × 5 min = 40 min until full validation data
+
+This is the **first time in the autonomous loop that we have two genuinely novel optimizer patches running back-to-back validation**. If either lands within champion noise (3.27-3.30), we have a defensible H100 escalation candidate. If both fail, we've efficiently falsified two paths in <1 hour.
diff --git a/runpod_tests/chore/08_patch_train_gpt.sh b/runpod_tests/chore/08_patch_train_gpt.sh
@@ -1088,6 +1088,38 @@ else:
         content = content.replace(old_loop, new_loop)
         print("  ✓ added WAVELET calls in GPT.forward")
 
+# Patch 18: USE_MUONEQ_R=1 → row-only normalization before Newton-Schulz orthogonalization.
+# From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with Lightweight
+# Equilibration" (Mar 2026). Used in 40+ openai/parameter-golf PRs, top record PR #1260
+# at val_bpb 1.0929 (3-seed mean).
+#
+# Mathematical formulation: for each Muon-managed weight matrix G (after momentum),
+#   row_norm[i] = sqrt(sum_j G[i,j]^2)
+#   G_normalized[i,j] = G[i,j] / row_norm[i]
+# After this, every row has unit L2 norm. Then standard Newton-Schulz.
+#
+# Distinct from Patch 17 Mousse: Mousse is row+col preconditioning (G/(||row||*||col||)),
+# MuonEq-R is row-only (G/||row||). They can stack: Mousse first, then MuonEq-R, then NS5.
+# PR #1440 uses both stacked. Implementation: 5 LOC, same anchor strategy as Patch 17 — but
+# anchored AFTER the Mousse block since Patch 17 runs first in this script.
+#
+# Idempotent via MUONEQ_R_MARKER. Anchored on the same Newton-Schulz call line which is
+# still present after Patch 17 (Patch 17's new_ns ends with that line).
+if "MUONEQ_R_MARKER" in content:
+    print("  ✓ MuonEq-R already applied")
+else:
+    old_ns_eqr = """                    g = zeropower_via_newtonschulz5(g, steps=backend_steps)"""
+    new_ns_eqr = """                    # MUONEQ_R_MARKER: optional row-only normalization (arxiv:2603.28254)
+                    if int(os.environ.get("USE_MUONEQ_R", "0")):
+                        _row_norm = g.norm(dim=-1, keepdim=True).clamp(min=1e-8)
+                        g = g / _row_norm
+                    g = zeropower_via_newtonschulz5(g, steps=backend_steps)"""
+    if old_ns_eqr in content:
+        content = content.replace(old_ns_eqr, new_ns_eqr)
+        print("  ✓ added MUONEQ_R row normalization")
+    else:
+        print("  ✗ MUONEQ_R anchor not found — skipping (MuonEq-R will be no-op)")
+
 # Patch 17: USE_MOUSSE=1 → diagonal Kronecker preconditioning before Newton-Schulz
 # orthogonalization in the Muon optimizer step.
 #
diff --git a/runpod_tests/loop/experiments.json b/runpod_tests/loop/experiments.json
@@ -28,5 +28,10 @@
   {"name": "MS0_mousse_alone", "USE_MOUSSE": "1", "USE_NGRAM_BIAS": "0", "MAX_WALLCLOCK_SECONDS": "300"},
   {"name": "MS1_mousse_plus_leaky_ng", "USE_MOUSSE": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
   {"name": "MS2_mousse_seed42", "USE_MOUSSE": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "SEED": "42", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
-  {"name": "MS3_mousse_plus_engram", "USE_MOUSSE": "1", "USE_ENGRAM_LITE": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.25", "NGRAM_W_TRIGRAM": "0.25", "NGRAM_W_FOURGRAM": "0.20", "MAX_WALLCLOCK_SECONDS": "300"}
+  {"name": "MS3_mousse_plus_engram", "USE_MOUSSE": "1", "USE_ENGRAM_LITE": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.25", "NGRAM_W_TRIGRAM": "0.25", "NGRAM_W_FOURGRAM": "0.20", "MAX_WALLCLOCK_SECONDS": "300"},
+
+  {"name": "MR0_muoneqr_alone", "USE_MUONEQ_R": "1", "USE_NGRAM_BIAS": "0", "MAX_WALLCLOCK_SECONDS": "300"},
+  {"name": "MR1_muoneqr_plus_leaky_ng", "USE_MUONEQ_R": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
+  {"name": "MR2_muoneqr_seed42", "USE_MUONEQ_R": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "SEED": "42", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
+  {"name": "MR3_mousse_plus_muoneqr", "USE_MOUSSE": "1", "USE_MUONEQ_R": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"}
 ]