You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
research fire openai#10: Patch 18 USE_MUONEQ_R — row-only normalization for Muon optimizer
From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with
Lightweight Equilibration" (Mar 2026). Used in 40+ openai/parameter-golf
PRs, top record PR openai#1260 = val_bpb 1.0929 (3-seed mean).
Inserts row normalization between Patch 17 Mousse block and Newton-Schulz:
row_norm[i] = sqrt(sum_j G[i,j]^2)
G[i,j] = G[i,j] / row_norm[i]
Distinct from Mousse: Mousse is row+col (G/||row||/||col||), MuonEq-R is
row-only (G/||row||). They can stack independently. Gated by USE_MUONEQ_R=1,
falls back gracefully when unset.
4 MR experiments queued for validation:
MR0_alone, MR1_plus_leaky_ng, MR2_seed42, MR3_mousse_plus_muoneqr
This is the second optimizer-side patch in two fires. Both patches fit our
train_loss metric so they can validate on cheap GPU loop without H100
escalation. If either lands within champion noise band 3.27-3.30, defensible
ship for final stack.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: RESEARCH_LOG.md
+58Lines changed: 58 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -809,3 +809,61 @@ This is why I overrode the subagent's cautious PASS. **First proper shippable pa
809
809
### Validation plan
810
810
811
811
Loop will pick up the new patch on next git pull (~5 min). MS family experiments will run within the next 2 hours via the runner cycle. Check on next monitor fire (~16:00 UTC) to see if MS1/MS2/MS3 land below 3.30 (within champion range) — if YES, Mousse is validated for H100 escalation bundle. If NO, we have evidence that even the simplified Mousse doesn't help at our 22M scale (a useful negative result either way).
812
+
813
+
---
814
+
815
+
## Research Fire #10 — 2026-04-08 (cron min :16, Track A) — Patch 18 USE_MUONEQ_R SHIPPED
816
+
817
+
**Subject**: Continue the optimizer-side vector after Patch 17 USE_MOUSSE success. Investigate "MuonEq-R" referenced in PR #1423 (1.0791 BPB) and many other top open submissions but never extracted.
818
+
819
+
### Subagent finding
820
+
821
+
**MuonEq-R = row-only normalization before Newton-Schulz**. From arxiv:2603.28254 "MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration" (Mar 30, 2026). Used in **40+ openai/parameter-golf PRs**, top record PR #1260 at val_bpb 1.0929 (3-seed mean).
822
+
823
+
**Formula**:
824
+
```
825
+
row_norm[i] = sqrt(sum_j G[i,j]^2) # L2 norm of row i
826
+
G_normalized[i,j] = G[i,j] / row_norm[i] # divide each row by its norm
827
+
```
828
+
Then standard Newton-Schulz on G_normalized. Each row of the result has unit L2 norm.
829
+
830
+
**Distinct from Patch 17 Mousse**: Mousse is row+col preconditioning (`G/(||row||*||col||)`), MuonEq-R is row-only (`G/||row||`). They are mathematically different and can stack independently. PR #1440 stacks both: Mousse first, then MuonEq-R, then NS5.
831
+
832
+
### Why I shipped this fire (no override needed — subagent agreed)
833
+
834
+
1.**Optimizer-side → fits our train_loss metric** (same reasoning as Mousse). We can validate on the cheap-GPU loop within ONE cycle after the runner pulls.
835
+
2.**5 LOC implementation** — same anchor strategy as Patch 17, contained inside the Muon optimizer step body.
836
+
3.**40+ PRs use it** — the highest-confidence port we've found in any research fire. PR #1260 specifically attributes +0.001 BPB to MuonEq-R alone.
837
+
4.**Stacks with Mousse** — we can run them independently, together, or against each other. Four experiments queued.
838
+
5.**Same risk profile as Patch 17** — gated, contained, falls back gracefully.
839
+
840
+
### Patch 18 USE_MUONEQ_R — code shipped this fire
841
+
842
+
Inserted between the Mousse block (Patch 17) and the Newton-Schulz call:
g = zeropower_via_newtonschulz5(g, steps=backend_steps)
849
+
```
850
+
851
+
5 lines of actual code. Marker MUONEQ_R_MARKER. Anchored on the same `g = zeropower_via_newtonschulz5(g, steps=backend_steps)` line that Patch 17 ended its block with — string replacement finds the line at the end of Patch 17's block and inserts the MuonEq-R block before it.
852
+
853
+
### Experiments queued (4 added → queue is now 28)
854
+
855
+
-**MR0_muoneqr_alone** — pure MuonEq-R without n-gram bias, isolation test
-**MR2_muoneqr_seed42** — multi-seed validation of MR1
858
+
-**MR3_mousse_plus_muoneqr** — STACK BOTH Mousse + MuonEq-R + leaky_relu + L5 weights — measures the additive value vs either alone
859
+
860
+
The MR3 stacked experiment is the most interesting — if it lands below MS1 (Mousse alone) AND below MR1 (MuonEq-R alone), then the two patches genuinely stack at our scale.
861
+
862
+
### Two optimizer-side patches in flight
863
+
864
+
Total optimizer-side experiments now in queue:
865
+
- 4 MS experiments (Patch 17 Mousse) — currently MS1 in flight, MS2/MS3 next
866
+
- 4 MR experiments (Patch 18 MuonEq-R) — will fire after MS family completes
867
+
- 8 experiments × 5 min = 40 min until full validation data
868
+
869
+
This is the **first time in the autonomous loop that we have two genuinely novel optimizer patches running back-to-back validation**. If either lands within champion noise (3.27-3.30), we have a defensible H100 escalation candidate. If both fail, we've efficiently falsified two paths in <1 hour.
0 commit comments