Skip to content

Commit da1c430

Browse files
committed
research(2026-04-15): SOTA Day 6 no change; Newton-Muon added; In-Place TTT clarified
- logs/daily_research.md: Apr 15 entry — merged SOTA 1.0810 Day 6 plateau confirmed (last upstream commit Apr 9); no new PRs or merges detected; Newton-Muon (arXiv:2604.01472) added as new tracked technique (+6% effective steps, drop-in Muon swap); In-Place TTT (arXiv:2604.06169) NTP-aligned loss distinguished from Session 3 reconstruction-loss failure - CLAUDE.md: Newton-Muon upgraded in technique table with quantified impact (~+288 steps at our scale); Session 14 lessons added (lessons 74–77) https://claude.ai/code/session_01QEyiKYb75JrpMzgvuGYK5V
1 parent 032c469 commit da1c430

2 files changed

Lines changed: 100 additions & 1 deletion

File tree

CLAUDE.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
203203
| **Cooldown+QAT fusion (arXiv:2509.22935)** | **~-0.002** | **WATCH — LR decay jointly with QAT; no artifact size change** |
204204
| **LaCT large-chunk TTT (arXiv:2505.23884)** | GPU util 0→70% | WATCH — PR #1560 Doc-TTT may be LaCT-style; dexhunter already implementing |
205205
| **SGT sparse depth recurrence (arXiv:2603.23998)** | saves FLOP budget | Watch — reduces Triple Loop FLOP overhead |
206-
| Newton-Muon (arXiv:2604.01472) | ~+4-6% steps | WATCH — Apr 2026, untested |
206+
| **Newton-Muon (arXiv:2604.01472)** | **~+6% steps (~+288 steps at our scale, ~-0.001 bpb)** | **WATCH — Apr 2, 2026; right-preconditioning via input second moment; 6% fewer iterations + 4% wall-clock vs Muon on nanoGPT speedrun; drop-in Muon swap. Verify additive with MuonEq-R before GPU spend.** |
207207
| MUD/MomentUm Decorrelation (arXiv:2603.17970) | +20-50% throughput | WATCH — triangular Cholesky whitening; 1.3–2.6× tokens/sec vs Muon |
208208
| Mousse (arXiv:2603.09697) | ~-0.002 to -0.003 | WATCH — Kronecker-factored preconditioning for Muon; ~12% fewer steps |
209209
| Infini-gram interpolation (arXiv:2401.17377) | large but legal unclear | WATCH — suffix array ∞-gram, normalized |
@@ -355,3 +355,11 @@ _Updated: 2026-04-13 (v12.2 — merged SOTA 1.0810 confirmed; PR #758 dead; GDN-
355355
73. **16 days remain. Implement PR #1586 (per-layer GPTQ + int7 emb) before any other change.** -0.01266 bpb, verified 3-seed, zero legality risk. This is the single fastest path to beating the merged SOTA. Do not wait for Casefold or Hedge Mixer rulings before running the GPU experiment.
356356

357357
_Updated: 2026-04-14 (v12.3 — merged SOTA 1.0810 Day 5 no change; PR #1610 PhasingTTT legal but low EV; PRISM arXiv:2602.10796 relevant to recurrence design; Ouroboros arXiv:2604.02051 watch; 16 days remaining)_
358+
359+
### Session 14 (2026-04-15)
360+
74. **Merged SOTA 1.0810 enters Day 6 plateau — longest since competition acceleration.** `git log upstream/main` confirms last commit was Apr 9 15:22 PDT. No new records or merges detected via git or web search. Eight open PRs remain in range. Expect imminent merge wave.
361+
75. **Newton-Muon (arXiv:2604.01472, Apr 2) is a drop-in Muon swap worth testing.** Right-preconditioning by input second moment gives 6% fewer iterations + 4% wall-clock vs standard Muon on nanoGPT speedrun benchmark. At our ~4800-step budget, 6% ≈ +288 effective steps ≈ small free bpb gain. NOT currently in technique table — added today. Verify it is additive with MuonEq-R (our current optimizer) before spending GPU; they may be redundant.
362+
76. **In-Place TTT (arXiv:2604.06169) is NOT the same as Session 3's failed attempt.** Session 3 used reconstruction loss on MLP output projections and saw loss blow up (2.63+). The Apr 7 paper uses an NTP-aligned loss, which is theoretically grounded for autoregressive LM. The "HARMFUL" lesson (#13) should not prevent trying In-Place TTT with NTP-aligned loss on a modern base. Low priority now; revisit after PR #1586 + VarLen+Doc-TTT are confirmed.
363+
77. **No new open PRs filed Apr 14–15 with competitive scores.** Web search and git log show nothing new. PR #1619 (likely illegal AdamW TTT) and PR #1616 (QK-Gain 5.5) are low-interest. The competitive field is in a holding pattern — same 8 PRs as yesterday.
364+
365+
_Updated: 2026-04-15 (v12.4 — merged SOTA 1.0810 Day 6 no change; Newton-Muon arXiv:2604.01472 added (+6% effective steps, verify vs MuonEq-R); In-Place TTT (2604.06169) NTP-aligned loss distinguishes it from Session 3 failure; 15 days remaining)_

logs/daily_research.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,94 @@
1+
# Parameter Golf Daily Research - 2026-04-15
2+
3+
## PR #771 STATUS: CLOSED (REJECTED) — no change
4+
5+
---
6+
7+
## N-GRAM PR STATUS
8+
9+
| PR | Score | Status | Notes |
10+
|----|-------|--------|-------|
11+
| #727 | 0.9674 | **CLOSED** (illegal) | Hashed n-gram cache — no change |
12+
| #741 | 0.9850 | **CLOSED** (illegal) | No change |
13+
| #758 | 1.0465 | **OPEN** (dead) | No new activity. XOR hash includes target token; effectively dead. |
14+
| #731 | 1.0400 | **OPEN** | Dense-count tables + Laplace smoothing. Awaiting seeds 1337+2024. No update. |
15+
16+
---
17+
18+
## Leaderboard
19+
20+
**Merged SOTA: 1.0810 (bigbag, PR #1493) — DAY 6 UNCHANGED.**
21+
22+
Last upstream commit: `75700cb 2026-04-09 15:22 PDT` (PR #1511, leaderboard README). Zero new records since Apr 9.
23+
24+
This is the longest plateau since the Apr 5–9 acceleration wave (4 records in 4 days). Either the field is stuck, or a wave of PRs is being prepared for end-of-month push. **15 days to deadline.**
25+
26+
Best open PRs (no changes from Apr 14):
27+
28+
| PR | Score | Author | Technique | Legal? |
29+
|----|-------|--------|-----------|--------|
30+
| #1585 | **1.0639** | codemath3000 | Casefold Tokenizer + Parallel Residuals + Systems Opt | **AWAIT RULING** |
31+
| #1578 | **1.0668** | mikeapedia | Casefold BPE retrain | **AWAIT RULING** |
32+
| #1560 | **1.07406** | dexhunter | VarLen Attention + Doc-TTT | **YES** |
33+
| #1586 | **1.07493** | dexhunter | Per-Layer Adaptive GPTQ + int7 Emb + MLR=0.026 | **YES** |
34+
| #1584 | **1.0752** | codemath3000 | Systems Opt (fused Muon + batched EMA + loader prealloc) | **YES** |
35+
| #1555 | **1.07636** | andrewbaggio1 | TMA Megakernel + Tap-In (min_match=1) | Tap-In unconfirmed |
36+
| #1541 | **1.07785** | bigbag | Improved Parallel Residuals + Muon 0.97 | ⚠️ hash embed flag |
37+
| #1540 | **1.0777** | aryanbhosale | VarLen + Doc-Independent LoRA TTT rank-96 | **YES** |
38+
| #1610 | **1.0728** | romeerp | VarLenAttn + PhasingTTT | **YES** (low EV) |
39+
40+
**Target**: ≤1.0760 bpb. 15 days remaining.
41+
42+
---
43+
44+
## What Changed (GitHub — Apr 14–15, 2026)
45+
46+
**No new merges. No new high-priority PRs detected via web search.** Day 6 plateau continues.
47+
48+
Checked via: `git log upstream/main -5` (Apr 9 is most recent) + web search for new submissions.
49+
50+
### PRs to watch for movement:
51+
- PR #1586 (per-layer GPTQ) — highest probability of merging next given 3-seed confirmation + zero flags
52+
- PR #1541 (bigbag improved residuals) — hash embed flag must clear first; bigbag is the merged-SOTA author so organizers watch his PRs closely
53+
- Casefold PRs (#1585, #1578) — ruling pending from @valerio-oai; if ruled legal, would reset our target to ≤1.0589
54+
55+
---
56+
57+
## New Research Papers
58+
59+
| Priority | Paper | arXiv ID | Date | Key Technique | Competition Relevance |
60+
|----------|-------|----------|------|---------------|----------------------|
61+
| **Add to plan** | **Newton-Muon Optimizer** | **2604.01472** | Apr 2, 2026 | Right-preconditioning by input second moment; surrogate quadratic model. Reaches target val loss in **6% fewer steps**, 4% less wall-clock vs standard Muon | **NOT YET IN PLAN.** Drop-in Muon replacement. At our budget (~4800 steps), 6% ≈ +288 extra effective steps. Small but free. Compatible with MuonEq-R base; verify they don't conflict before adding. |
62+
| Already tracked | In-Place TTT | 2604.06169 | Apr 7, 2026 | MLP final-projection fast weights + NTP-aligned loss + chunk-wise updates | Score-first compatible. Key distinction from Session 3: uses NTP loss not reconstruction loss. Lesson #13 ("HARMFUL") used reconstruction loss on a different model. Could retry with NTP-aligned loss before dismissing permanently. Low priority until base stack is confirmed. |
63+
| Already tracked | PRISM | 2602.10796 | Feb 2026 | Parallelizable iterative residual correction; 174× vs serial | Architectural inspiration for Triple Loop improvement — read before next recurrence change |
64+
| Already tracked | Ouroboros | 2604.02051 | Apr 2, 2026 | Hypernetwork-generated per-step LoRA modulation for recursive blocks | 9.2M extra params overhead; likely too expensive for 16MB budget. Watch for competition PR. |
65+
| Already tracked | Mousse | 2603.09697 | Mar 2026 | Kronecker-factored preconditioning for Muon; ~12% fewer steps | Higher EV than Newton-Muon but more overhead |
66+
67+
---
68+
69+
## HuggingFace / Community
70+
71+
No new relevant blog posts or model releases. Web search for "parameter-golf 1.06 OR 1.05" returned only PR list page — no new scores below 1.06 surfacing publicly.
72+
73+
---
74+
75+
## Recommended Action
76+
77+
**No strategy change from Apr 14. One addition: add Newton-Muon to technique tracking.**
78+
79+
Priority order:
80+
1. **Next GPU run: Implement PR #1586** (per-layer GPTQ + int7 emb + MLR=0.026). Expected: ~1.068–1.070 bpb. Config changes only: `clip_sigmas={'mlp': 12.0, 'attn': 13.0, 'emb': 15.0}, MATRIX_LR=0.026, emb_bits=7`.
81+
2. **Same run: Add VarLen Attention + Doc-TTT (PR #1560 approach).** Combined expected: ~1.062–1.068 bpb.
82+
3. **Watch PR #1541** — if hash embed flag clears and it merges, new target becomes ≤1.0728.
83+
4. **Newton-Muon (arXiv:2604.01472)**: Evaluate as a Muon swap in a follow-up run. +288 effective steps at our scale. Check if MuonEq-R and Newton-Muon are additive or redundant before GPU spend.
84+
5. **Do NOT implement**: Casefold (#1585, await ruling), PR #758 (dead), any AdamW TTT.
85+
86+
---
87+
88+
_Updated: 2026-04-15 (merged SOTA 1.0810 Day 6 no change; no new PRs; Newton-Muon arXiv:2604.01472 added as new tracked technique (+6% effective steps); 15 days remaining)_
89+
90+
---
91+
192
# Parameter Golf Daily Research - 2026-04-14
293

394
## PR #771 STATUS: CLOSED (REJECTED) — no change

0 commit comments

Comments
 (0)