research(2026-04-15): SOTA Day 6 no change; Newton-Muon added; In-Place TTT clarified

claude · claude · commit da1c43051802 · 2026-04-15T17:18:01.000Z
- logs/daily_research.md: Apr 15 entry — merged SOTA 1.0810 Day 6 plateau confirmed (last upstream commit Apr 9); no new PRs or merges detected; Newton-Muon (arXiv:2604.01472) added as new tracked technique (+6% effective steps, drop-in Muon swap); In-Place TTT (arXiv:2604.06169) NTP-aligned loss distinguished from Session 3 reconstruction-loss failure - CLAUDE.md: Newton-Muon upgraded in technique table with quantified impact (~+288 steps at our scale); Session 14 lessons added (lessons 74–77) https://claude.ai/code/session_01QEyiKYb75JrpMzgvuGYK5V
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -203,7 +203,7 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
 | **Cooldown+QAT fusion (arXiv:2509.22935)** | **~-0.002** | **WATCH — LR decay jointly with QAT; no artifact size change** |
 | **LaCT large-chunk TTT (arXiv:2505.23884)** | GPU util 0→70% | WATCH — PR #1560 Doc-TTT may be LaCT-style; dexhunter already implementing |
 | **SGT sparse depth recurrence (arXiv:2603.23998)** | saves FLOP budget | Watch — reduces Triple Loop FLOP overhead |
-| Newton-Muon (arXiv:2604.01472) | ~+4-6% steps | WATCH — Apr 2026, untested |
+| **Newton-Muon (arXiv:2604.01472)** | **~+6% steps (~+288 steps at our scale, ~-0.001 bpb)** | **WATCH — Apr 2, 2026; right-preconditioning via input second moment; 6% fewer iterations + 4% wall-clock vs Muon on nanoGPT speedrun; drop-in Muon swap. Verify additive with MuonEq-R before GPU spend.** |
 | MUD/MomentUm Decorrelation (arXiv:2603.17970) | +20-50% throughput | WATCH — triangular Cholesky whitening; 1.3–2.6× tokens/sec vs Muon |
 | Mousse (arXiv:2603.09697) | ~-0.002 to -0.003 | WATCH — Kronecker-factored preconditioning for Muon; ~12% fewer steps |
 | Infini-gram interpolation (arXiv:2401.17377) | large but legal unclear | WATCH — suffix array ∞-gram, normalized |
@@ -355,3 +355,11 @@ _Updated: 2026-04-13 (v12.2 — merged SOTA 1.0810 confirmed; PR #758 dead; GDN-
 73. **16 days remain. Implement PR #1586 (per-layer GPTQ + int7 emb) before any other change.** -0.01266 bpb, verified 3-seed, zero legality risk. This is the single fastest path to beating the merged SOTA. Do not wait for Casefold or Hedge Mixer rulings before running the GPU experiment.
 
 _Updated: 2026-04-14 (v12.3 — merged SOTA 1.0810 Day 5 no change; PR #1610 PhasingTTT legal but low EV; PRISM arXiv:2602.10796 relevant to recurrence design; Ouroboros arXiv:2604.02051 watch; 16 days remaining)_
+
+### Session 14 (2026-04-15)
+74. **Merged SOTA 1.0810 enters Day 6 plateau — longest since competition acceleration.** `git log upstream/main` confirms last commit was Apr 9 15:22 PDT. No new records or merges detected via git or web search. Eight open PRs remain in range. Expect imminent merge wave.
+75. **Newton-Muon (arXiv:2604.01472, Apr 2) is a drop-in Muon swap worth testing.** Right-preconditioning by input second moment gives 6% fewer iterations + 4% wall-clock vs standard Muon on nanoGPT speedrun benchmark. At our ~4800-step budget, 6% ≈ +288 effective steps ≈ small free bpb gain. NOT currently in technique table — added today. Verify it is additive with MuonEq-R (our current optimizer) before spending GPU; they may be redundant.
+76. **In-Place TTT (arXiv:2604.06169) is NOT the same as Session 3's failed attempt.** Session 3 used reconstruction loss on MLP output projections and saw loss blow up (2.63+). The Apr 7 paper uses an NTP-aligned loss, which is theoretically grounded for autoregressive LM. The "HARMFUL" lesson (#13) should not prevent trying In-Place TTT with NTP-aligned loss on a modern base. Low priority now; revisit after PR #1586 + VarLen+Doc-TTT are confirmed.
+77. **No new open PRs filed Apr 14–15 with competitive scores.** Web search and git log show nothing new. PR #1619 (likely illegal AdamW TTT) and PR #1616 (QK-Gain 5.5) are low-interest. The competitive field is in a holding pattern — same 8 PRs as yesterday.
+
+_Updated: 2026-04-15 (v12.4 — merged SOTA 1.0810 Day 6 no change; Newton-Muon arXiv:2604.01472 added (+6% effective steps, verify vs MuonEq-R); In-Place TTT (2604.06169) NTP-aligned loss distinguishes it from Session 3 failure; 15 days remaining)_
diff --git a/logs/daily_research.md b/logs/daily_research.md
@@ -1,3 +1,94 @@
+# Parameter Golf Daily Research - 2026-04-15
+
+## PR #771 STATUS: CLOSED (REJECTED) — no change
+
+---
+
+## N-GRAM PR STATUS
+
+| PR | Score | Status | Notes |
+|----|-------|--------|-------|
+| #727 | 0.9674 | **CLOSED** (illegal) | Hashed n-gram cache — no change |
+| #741 | 0.9850 | **CLOSED** (illegal) | No change |
+| #758 | 1.0465 | **OPEN** (dead) | No new activity. XOR hash includes target token; effectively dead. |
+| #731 | 1.0400 | **OPEN** | Dense-count tables + Laplace smoothing. Awaiting seeds 1337+2024. No update. |
+
+---
+
+## Leaderboard
+
+**Merged SOTA: 1.0810 (bigbag, PR #1493) — DAY 6 UNCHANGED.**
+
+Last upstream commit: `75700cb 2026-04-09 15:22 PDT` (PR #1511, leaderboard README). Zero new records since Apr 9.
+
+This is the longest plateau since the Apr 5–9 acceleration wave (4 records in 4 days). Either the field is stuck, or a wave of PRs is being prepared for end-of-month push. **15 days to deadline.**
+
+Best open PRs (no changes from Apr 14):
+
+| PR | Score | Author | Technique | Legal? |
+|----|-------|--------|-----------|--------|
+| #1585 | **1.0639** | codemath3000 | Casefold Tokenizer + Parallel Residuals + Systems Opt | **AWAIT RULING** |
+| #1578 | **1.0668** | mikeapedia | Casefold BPE retrain | **AWAIT RULING** |
+| #1560 | **1.07406** | dexhunter | VarLen Attention + Doc-TTT | **YES** |
+| #1586 | **1.07493** | dexhunter | Per-Layer Adaptive GPTQ + int7 Emb + MLR=0.026 | **YES** |
+| #1584 | **1.0752** | codemath3000 | Systems Opt (fused Muon + batched EMA + loader prealloc) | **YES** |
+| #1555 | **1.07636** | andrewbaggio1 | TMA Megakernel + Tap-In (min_match=1) | Tap-In unconfirmed |
+| #1541 | **1.07785** | bigbag | Improved Parallel Residuals + Muon 0.97 | ⚠️ hash embed flag |
+| #1540 | **1.0777** | aryanbhosale | VarLen + Doc-Independent LoRA TTT rank-96 | **YES** |
+| #1610 | **1.0728** | romeerp | VarLenAttn + PhasingTTT | **YES** (low EV) |
+
+**Target**: ≤1.0760 bpb. 15 days remaining.
+
+---
+
+## What Changed (GitHub — Apr 14–15, 2026)
+
+**No new merges. No new high-priority PRs detected via web search.** Day 6 plateau continues.
+
+Checked via: `git log upstream/main -5` (Apr 9 is most recent) + web search for new submissions.
+
+### PRs to watch for movement:
+- PR #1586 (per-layer GPTQ) — highest probability of merging next given 3-seed confirmation + zero flags
+- PR #1541 (bigbag improved residuals) — hash embed flag must clear first; bigbag is the merged-SOTA author so organizers watch his PRs closely
+- Casefold PRs (#1585, #1578) — ruling pending from @valerio-oai; if ruled legal, would reset our target to ≤1.0589
+
+---
+
+## New Research Papers
+
+| Priority | Paper | arXiv ID | Date | Key Technique | Competition Relevance |
+|----------|-------|----------|------|---------------|----------------------|
+| **Add to plan** | **Newton-Muon Optimizer** | **2604.01472** | Apr 2, 2026 | Right-preconditioning by input second moment; surrogate quadratic model. Reaches target val loss in **6% fewer steps**, 4% less wall-clock vs standard Muon | **NOT YET IN PLAN.** Drop-in Muon replacement. At our budget (~4800 steps), 6% ≈ +288 extra effective steps. Small but free. Compatible with MuonEq-R base; verify they don't conflict before adding. |
+| Already tracked | In-Place TTT | 2604.06169 | Apr 7, 2026 | MLP final-projection fast weights + NTP-aligned loss + chunk-wise updates | Score-first compatible. Key distinction from Session 3: uses NTP loss not reconstruction loss. Lesson #13 ("HARMFUL") used reconstruction loss on a different model. Could retry with NTP-aligned loss before dismissing permanently. Low priority until base stack is confirmed. |
+| Already tracked | PRISM | 2602.10796 | Feb 2026 | Parallelizable iterative residual correction; 174× vs serial | Architectural inspiration for Triple Loop improvement — read before next recurrence change |
+| Already tracked | Ouroboros | 2604.02051 | Apr 2, 2026 | Hypernetwork-generated per-step LoRA modulation for recursive blocks | 9.2M extra params overhead; likely too expensive for 16MB budget. Watch for competition PR. |
+| Already tracked | Mousse | 2603.09697 | Mar 2026 | Kronecker-factored preconditioning for Muon; ~12% fewer steps | Higher EV than Newton-Muon but more overhead |
+
+---
+
+## HuggingFace / Community
+
+No new relevant blog posts or model releases. Web search for "parameter-golf 1.06 OR 1.05" returned only PR list page — no new scores below 1.06 surfacing publicly.
+
+---
+
+## Recommended Action
+
+**No strategy change from Apr 14. One addition: add Newton-Muon to technique tracking.**
+
+Priority order:
+1. **Next GPU run: Implement PR #1586** (per-layer GPTQ + int7 emb + MLR=0.026). Expected: ~1.068–1.070 bpb. Config changes only: `clip_sigmas={'mlp': 12.0, 'attn': 13.0, 'emb': 15.0}, MATRIX_LR=0.026, emb_bits=7`.
+2. **Same run: Add VarLen Attention + Doc-TTT (PR #1560 approach).** Combined expected: ~1.062–1.068 bpb.
+3. **Watch PR #1541** — if hash embed flag clears and it merges, new target becomes ≤1.0728.
+4. **Newton-Muon (arXiv:2604.01472)**: Evaluate as a Muon swap in a follow-up run. +288 effective steps at our scale. Check if MuonEq-R and Newton-Muon are additive or redundant before GPU spend.
+5. **Do NOT implement**: Casefold (#1585, await ruling), PR #758 (dead), any AdamW TTT.
+
+---
+
+_Updated: 2026-04-15 (merged SOTA 1.0810 Day 6 no change; no new PRs; Newton-Muon arXiv:2604.01472 added as new tracked technique (+6% effective steps); 15 days remaining)_
+
+---
+
 # Parameter Golf Daily Research - 2026-04-14
 
 ## PR #771 STATUS: CLOSED (REJECTED) — no change