Skip to content

Commit 671e5b4

Browse files
committed
research(2026-04-16): SOTA Day 7 no change; PR openai#1667 Attention Output Gate; PR openai#1670 dexhunter 1.05970 casefold pending; PR openai#1647 SLOT-4 risky; Session 15
https://claude.ai/code/session_01VS9iDJJ7C5Qqpk8AAd1Avv
1 parent da1c430 commit 671e5b4

2 files changed

Lines changed: 143 additions & 6 deletions

File tree

CLAUDE.md

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,10 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
112112

113113
## Competition Strategy
114114

115-
**Merged leaderboard SOTA**: **1.0810 val_bpb** (bigbag, PR #1493, 2026-04-09) — NO CHANGE (confirmed Apr 13)
116-
**Best open legal PRs (Apr 13 update)**:
115+
**Merged leaderboard SOTA**: **1.0810 val_bpb** (bigbag, PR #1493, 2026-04-09) — NO CHANGE (confirmed Apr 16, Day 7 plateau)
116+
**Best open legal PRs (Apr 16 update)**:
117+
- PR #1670 (dexhunter, **1.05970**): Casefold V4 + Multi-Phase Global SGD TTT — **AWAIT CASEFOLD RULING (Issue #1604)**
118+
- PR #1667 (MarioPaerle, **1.07139**): SmearGate + Attention Output Gate (1,056 params, 12×8×11 heads) + Legal TTT — **CLEAN, no reviews, stack on #1586**
117119
- PR #1586 (dexhunter, **1.07493**): Per-Layer Adaptive GPTQ (MLP=12σ, Attn=13σ) + int7 Emb (15σ) + MLR=0.026 — **CLEAN, implement immediately**
118120
- PR #1560 (dexhunter, **1.07406**): VarLen Attention + Triton Fused MLP + Doc-TTT — appears legal (no reviews yet)
119121
- PR #1584 (codemath3000, **1.0752**): Systems-only (fused Muon + batched EMA + loader prealloc), ~20 extra steps
@@ -124,9 +126,10 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
124126
- PR #1576 (joshkmartinez, **~~1.01671~~**): GDN-Hybrid — **BPB BUG confirmed by reviewer** (space token double-count from PR #1545), actual ~1.16–1.18 BPB. Do NOT track.
125127
- PR #1585 (codemath3000, **1.0639**): Casefold Tokenizer — **LEGALITY DEBATED** (modifying val corpus bytes); await organizer ruling
126128
- PR #1578 (mikeapedia, **1.0668**): Custom Casefold Tokenizer — **LEGALITY DEBATED**; same concern as #1585
127-
**Best open with SLOT**: ~1.0766 val_bpb (PR #1333, aryanbhosale, Causal SLOT-16 on PR #1334 base) — no organizer rejection
129+
- PR #1647 (powerpratik, **1.0616**): SLOT-4 + TTT + 3-Layer Recurrence + Parallel Residuals — ⚠️ standard SLOT, no reviews
130+
**Best open with SLOT**: ~1.0616 val_bpb (PR #1647, powerpratik, SLOT-4) — no reviews yet
128131
**Best open (illegal)**: 1.0632 (PR #1517, RulinShao, Pre-Quant TTT 18ep — same ruling as #1351/#1416)
129-
**Target**: Beat 1.0810 merged SOTA by >=0.005 nats → need **≤1.0760 bpb**. Best reachable: ~1.068–1.075 (legal). With SLOT: ~1.065–1.073. **17 days to deadline (Apr 30).**
132+
**Target**: Beat 1.0810 merged SOTA by >=0.005 nats → need **≤1.0760 bpb**. Best reachable: ~1.068–1.072 (legal stack #1586+#1667+#1560). With casefold if ruled legal: ~1.059. **14 days to deadline (Apr 30).**
130133

131134
**CRITICAL LEGALITY UPDATES**:
132135
- **PR #771 REJECTED (2026-03-27)** — Our AdamW TTT 30ep was train-then-score. All 30-epoch TTT results void.
@@ -156,9 +159,10 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
156159
11. **Legal Score-First TTT (post-quant, ≤3ep)** — lr=0.005, all blocks
157160
12. **VarLen Attention (per-document causal masking)** — PR #1560, ~-0.007 bpb — **add next**
158161
13. **Doc-TTT (per-document score-first TTT)** — PR #1560, chunk size=48, Muon 0.97 — **add next**
159-
14. **TMA Megakernel (Triton TMA fused MLP)** — PR #1555, +10.5% throughput = ~200 extra steps — add after base validated
162+
14. **Attention Output Gate + SmearGate (PR #1667)** — 1,056 extra params (12×8×11 heads); multiplicative per-head gate init to zero; appears legal, no reviews yet; stack with #1586**evaluate in same run**
163+
15. **TMA Megakernel (Triton TMA fused MLP)** — PR #1555, +10.5% throughput = ~200 extra steps — add after base validated
160164

161-
**Key reference PRs**: #1493 (merged SOTA 1.0810), #1586 (1.07493, per-layer GPTQ — implement now), #1560 (1.07406, best open safe — VarLen+Doc-TTT), #1584 (1.0752, systems opt — fused Muon/EMA/prealloc), #1555 (1.07636, TMA Megakernel+Tap-In), #1333 (1.07660, Causal SLOT-16 — risky), #1437 (1.08091, causal-fixed N-gram Tilt kernel — use this), #1413 (1.08279, SP8192+Legal TTT), #1334 (1.0897, arch reference), #1229 (0.9300, scored-position SLOT, open)
165+
**Key reference PRs**: #1493 (merged SOTA 1.0810), #1670 (1.05970, dexhunter Casefold V4+Multi-Phase TTT — await casefold ruling), #1667 (1.07139, Attention Output Gate+SmearGate — clean, stack on #1586), #1586 (1.07493, per-layer GPTQ — implement now), #1560 (1.07406, best open safe — VarLen+Doc-TTT), #1584 (1.0752, systems opt — fused Muon/EMA/prealloc), #1555 (1.07636, TMA Megakernel+Tap-In), #1333 (1.07660, Causal SLOT-16 — risky), #1437 (1.08091, causal-fixed N-gram Tilt kernel — use this), #1413 (1.08279, SP8192+Legal TTT), #1334 (1.0897, arch reference), #1229 (0.9300, scored-position SLOT, open)
162166

163167
**Abandoned approaches**: Training-time static LoRA TTT (hurts), product quantization (SWA-incompatible), custom Triton kernels (poor EV — REVERTED: PR #1420 shows +10% via Triton TMA, revisit after base works), int4 without QAT (quality-destructive), eval stride=32 (time budget), AdamW TTT 30ep (illegal), n-gram hash cache (illegal), pre-quant TTT any form (illegal), Eval-Time Hash Embedding trained at inference (suspect illegal — same adapt-then-score pattern), Tap-In V6 document-local matching (await ruling), GDN-Hybrid #1576 (BPB bug — actual ~1.17 not 1.01671).
164168
**NOTE**: Doc-Independent LoRA TTT (PR #1540, rank-96, resets per batch, score-first) is categorically DIFFERENT from abandoned LoRA TTT and appears legal — consider adopting.
@@ -178,6 +182,7 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
178182
| **VarLen Attention + Doc-TTT** | **~-0.007** | **LEGAL — PR #1560 (dexhunter, 1.07406 BPB); per-document causal masking + score-first TTT per-doc; LoRA chunk=48** |
179183
| **TMA Megakernel (Triton Hopper fused MLP)** | **+200 steps (~-0.002)** | **LEGAL — PR #1555; +10.5% throughput; add after base validated** |
180184
| **Tap-In Unigram Matching (min_match=1)** | **~-0.009** | **LEGALITY UNCONFIRMED — PR #1555; 21% activation rate; verify before implementing** |
185+
| **Attention Output Gate + SmearGate (PR #1667)** | **~-0.006 bpb (vs merged SOTA)** | **APPEARS LEGAL — PR #1667 (MarioPaerle, 1.07139 BPB); per-head multiplicative gate (1,056 params, init to zero); SmearGate width=12; no reviews; stack on PR #1586** |
181186
| **Per-Layer Adaptive GPTQ (MLP=12σ, Attn=13σ) + int7 Emb** | **-0.013 nats (-0.0046 bpb)** | **LEGAL — PR #1586 (dexhunter, 1.07493 BPB); MLP tighter clip, Attn looser, int7 emb saves 530KB; MLR=0.026; IMPLEMENT IMMEDIATELY** |
182187
| **Systems Opt (fused Muon + batched EMA + loader prealloc)** | **~+20 steps (~-0.001 bpb)** | **LEGAL — PR #1584 (codemath3000, 1.0752); pure kernel/memory efficiency; no ML changes** |
183188
| **Casefold Tokenizer (NFKC + lowercase BPE retrain)** | **~-0.017 bpb** | **LEGALITY DEBATED — PR #1578 (1.0668), #1585 (1.0639); modifying val corpus byte count raises comparability concern; await @valerio-oai ruling** |
@@ -363,3 +368,13 @@ _Updated: 2026-04-14 (v12.3 — merged SOTA 1.0810 Day 5 no change; PR #1610 Pha
363368
77. **No new open PRs filed Apr 14–15 with competitive scores.** Web search and git log show nothing new. PR #1619 (likely illegal AdamW TTT) and PR #1616 (QK-Gain 5.5) are low-interest. The competitive field is in a holding pattern — same 8 PRs as yesterday.
364369

365370
_Updated: 2026-04-15 (v12.4 — merged SOTA 1.0810 Day 6 no change; Newton-Muon arXiv:2604.01472 added (+6% effective steps, verify vs MuonEq-R); In-Place TTT (2604.06169) NTP-aligned loss distinguishes it from Session 3 failure; 15 days remaining)_
371+
372+
### Session 15 (2026-04-16)
373+
78. **Merged SOTA 1.0810 — Day 7 plateau, longest in competition history.** Seven days since last merge (Apr 9). With 14 days to deadline, the field appears to be preparing a late push. Do not take the plateau as stability — a wave of merges is likely imminent given 8+ open PRs in the 1.062–1.078 range.
374+
79. **PR #1667 (MarioPaerle, 1.07139) is a new clean stackable technique.** Attention Output Gate: 1,056 parameter multiplicative gate on attention output heads (12 weights × 8 heads × 11 layers), initialized to zero so scale starts at 1.0. SmearGate reintroduced (width=12, input-dependent). Legal score-first TTT (3ep, SGD, LR=0.005). Artifact 15.927 MB. No legality flags. Stack this on top of PR #1586 before next GPU run.
375+
80. **PR #1670 (dexhunter, 1.05970) is the new best open PR — but depends on casefold ruling.** Casefold V4 + Multi-Phase Global SGD TTT achieves 1.05970 (std 0.00031, 3-seed). The Casefold legality question (Issue #1604) has no @valerio-oai ruling as of Apr 16. Do NOT implement until ruled. If casefold is approved, this becomes the primary target and resets our goal to ≤1.0499.
376+
81. **PR #1647 (powerpratik, 1.0616) uses standard SLOT-4 — high risk.** Delta-vector logit bias optimized 4 AdamW steps per window. No organizer reviews yet. Standard SLOT (not causal SLOT-16). Risk: @valerio-oai could rule at any time. Only implement if willing to accept rejection.
377+
82. **PR #731 (Hedge Mixer, 1.0400) is close to merge — 2 seeds pending.** Dense-count tables + Laplace smoothing + 5-expert ensemble. Reviewer confirmed score-first per chunk and said "LOOKS CLEAN." Seeds 1337 and 2024 are the only remaining gate. If both seeds confirm ~1.04, this merges and gives us a legal n-gram mixer blueprint.
378+
83. **dexhunter now holds 3 of the top-5 open legal PRs (#1560, #1586, #1670).** Highly reliable submitter with zero legality flags across all PRs. Copy techniques from his PRs with confidence.
379+
380+
_Updated: 2026-04-16 (v12.5 — merged SOTA 1.0810 Day 7; PR #1667 Attention Output Gate new clean stackable tech; PR #1670 dexhunter 1.05970 best open but casefold pending; PR #1647 SLOT-4 risky; PR #731 seeds pending; 14 days remaining)_

logs/daily_research.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,125 @@
1+
# Parameter Golf Daily Research - 2026-04-16
2+
3+
## PR #771 STATUS: CLOSED (REJECTED) — no change
4+
5+
@valerio-oai ruling (confirmed): "adapting model to eval tokens with TTT for multiple epochs, then reporting val numbers on those same tokens." No appeal path.
6+
7+
---
8+
9+
## N-GRAM PR STATUS
10+
11+
| PR | Score | Status | Notes |
12+
|----|-------|--------|-------|
13+
| #727 | 0.9674 | **CLOSED** (illegal) | Hashed n-gram cache — ruled out Mar 27 |
14+
| #741 | 0.9850 | **CLOSED** (illegal) | Author self-closed, same illegality |
15+
| #758 | 1.0465 | **OPEN** (dead) | XOR hash key includes target token — same violation as #727. No new activity. |
16+
| #731 | 1.0400 | **OPEN** | Dense-count + Laplace smoothing. MatoTeziTanka "LOOKS CLEAN." Seed 42 only; seeds 1337+2024 pending. 6104 steps, 15,999,919 bytes. |
17+
18+
---
19+
20+
## Leaderboard
21+
22+
**Merged SOTA: 1.0810 (bigbag, PR #1493) — DAY 7 UNCHANGED.**
23+
24+
Last upstream commit: `75700cb` April 9, 2026. Longest plateau since the Apr 5–9 acceleration wave. No new records in 7 days. Expect a merge wave before deadline (April 30 = 14 days).
25+
26+
### Best Open PRs (updated Apr 16)
27+
28+
| PR | Score | Author | Technique | Legal? |
29+
|----|-------|--------|-----------|--------|
30+
| #1670 | **1.05970** | dexhunter | Casefold V4 + Multi-Phase Global SGD TTT | **AWAIT CASEFOLD RULING** |
31+
| #1647 | **1.0616** | powerpratik | SLOT-4 + TTT + 3-Layer Recurrence + Parallel Residuals | ⚠️ SLOT unruled |
32+
| #1585 | **1.0639** | codemath3000 | Casefold Tokenizer + Parallel Residuals + Systems Opt | **AWAIT RULING** |
33+
| #1578 | **1.0668** | mikeapedia | Custom Casefold BPE retrain | **AWAIT RULING** |
34+
| #1560 | **1.07406** | dexhunter | VarLen Attention + Doc-TTT | **YES** |
35+
| #1586 | **1.07493** | dexhunter | Per-Layer Adaptive GPTQ + int7 Emb + MLR=0.026 | **YES** |
36+
| #1667 | **1.07139** | MarioPaerle | SmearGate + Attention Output Gate (1,056 params) + Legal TTT | **YES — no reviews yet, appears clean** |
37+
| #1610 | **1.0728** | romeerp | VarLenAttn + PhasingTTT | YES (low EV) |
38+
| #1584 | **1.0752** | codemath3000 | Systems Opt (fused Muon + batched EMA + loader prealloc) | **YES** |
39+
| #1555 | **1.07636** | andrewbaggio1 | TMA Megakernel + Tap-In (min_match=1) | Tap-In unconfirmed |
40+
| #1541 | **1.07785** | bigbag | Improved Parallel Residuals + Muon 0.97 | ⚠️ hash embed flag |
41+
| #1540 | **1.0777** | aryanbhosale | VarLen + Doc-Independent LoRA TTT rank-96 | **YES** |
42+
43+
**Target**: ≤1.0760 bpb. 14 days remaining (April 30 deadline).
44+
45+
---
46+
47+
## What Changed (GitHub — Apr 15–16, 2026)
48+
49+
### No new merges. Day 7 plateau continues.
50+
51+
### New Open PRs (filed Apr 14–16)
52+
53+
**PR #1670** (dexhunter, **1.05970**, new best open) — ⚠️ AWAIT CASEFOLD RULING
54+
- Casefold V4: lowercase normalization before SP8192 tokenization ("reduces vocabulary entropy")
55+
- Multi-Phase Global SGD TTT: 3 phases across 2000 prefix documents (builds on PR #1626)
56+
- std dev 0.00031 (3-seed), artifact ~15.20 MB
57+
- TTT phase ordering unclear (score-first vs. train-then-score not explicit in docs)
58+
- **Depends on casefold ruling at Issue #1604** (open, no @valerio-oai comment yet)
59+
- **Do NOT implement until casefold ruled legal**
60+
61+
**PR #1667** (MarioPaerle, **1.07139**) — ✅ CLEAN, APPEARS LEGAL
62+
- Attention Output Gate: lightweight per-head multiplicative gate on attention output; 1,056 new params (12 weights × 8 heads × 11 layers); initialized to zero → scale=1.0 at start
63+
- SmearGate: reintroduced with input dependence (Modded Nano GPT style), width=12
64+
- Legal score-first TTT, 3ep, LR=0.005, SGD
65+
- 3-seed mean 1.07139 (std 0.00082), artifact 15.927 MB (max 15.94 MB)
66+
- No organizer feedback; self-certified compliance
67+
- **Stack this on PR #1586 for potential additive improvement**
68+
69+
**PR #1647** (powerpratik, **1.0616**) — ⚠️ RISKY (SLOT)
70+
- SLOT-4: per-window delta-vector logit bias, 4 AdamW steps
71+
- Standard SLOT (not Causal SLOT-16)
72+
- No reviews yet from any reviewer
73+
- Do NOT implement until SLOT receives organizer ruling
74+
75+
**PR #1671** (souro26, 1.3827): Token-wise gating — well above baseline, skip
76+
**PR #1666** (mrbese, 1.1531): BESE 288-vocab tokenizer — not competitive
77+
78+
### Issue #1604 (casefold tokenizer legality): Still OPEN
79+
- Filed Apr 13 by mikeapedia; no @valerio-oai comment as of Apr 16
80+
- Core question: does NFKC + lowercase on validation corpus constitute invalid benchmark manipulation?
81+
- Three community members debating; no ruling
82+
83+
---
84+
85+
## New Research Papers
86+
87+
| Priority | Paper | arXiv ID | Date | Key Technique | Applicability |
88+
|----------|-------|----------|------|---------------|--------------|
89+
| Watch | Self-Calibrating LMs via TTT Discriminative Distillation (SECL) | 2604.09624 | Apr 2026 | TTT pipeline that reduces ECE via discriminative distillation; score-first compatible | Targets calibration (ECE), not BPB. Low direct impact on our metric. |
90+
| Already tracked | End-to-End TTT for Long Context | 2512.23675 | Dec 2025 | Compresses context to weights at test time via next-token prediction; scales with context length | Relevant to Doc-TTT quality; LaCT (2505.23884) is the higher-EV variant already in plan |
91+
| Already tracked | Newton-Muon | 2604.01472 | Apr 2026 | +6% fewer steps, +4% wall-clock vs standard Muon | Verify additive with MuonEq-R before GPU spend |
92+
| Skip | LieQ (layer-wise quant for small LMs) | 2508.03332 | Aug 2025 | Canonical division of labour across layers for PTQ; 2-bit target | Not applicable — we use int6/int7 GPTQ, not sub-4-bit regime |
93+
94+
No new breakthrough papers today. arXiv:2604.09624 (SECL) is the sole new find; low direct impact.
95+
96+
---
97+
98+
## HuggingFace / Community
99+
100+
No new relevant blog posts. dexhunter filed PR #1670 (1.05970) — their third top-10 PR (#1560, #1586, #1670). MarioPaerle is a new submitter worth watching (PR #1667 technique is clean and implementable).
101+
102+
---
103+
104+
## Recommended Action
105+
106+
**No change to core strategy. Two additions: PR #1667 Attention Output Gate is now a candidate to stack; casefold watch continues.**
107+
108+
Priority order for next GPU run:
109+
1. **Implement PR #1586** (per-layer GPTQ: MLP=12σ, Attn=13σ, Emb int7@15σ; MLR=0.026). Config-level change, -0.01266 nats confirmed, zero legality risk.
110+
2. **Add VarLen Attention + Doc-TTT** (PR #1560 approach): -0.007 bpb. Combined target with #1: ~1.062–1.068 bpb.
111+
3. **Evaluate PR #1667 Attention Output Gate + SmearGate** on same run or follow-up: 1,056 extra params, no legality concerns. If additive with #1586 + #1560, expected combined ~1.065–1.070.
112+
4. **Watch PR #1731** — if third seed confirms 1.0400 BPB and merges, Hedge Mixer (legal n-gram interpolation) is adoptable.
113+
5. **Watch Issue #1604** — if casefold ruled legal, PR #1670 (dexhunter, 1.05970) jumps to highest-EV action; reset target to ≤1.0499.
114+
115+
**Do NOT implement**: Casefold (#1670, #1585, #1578 — await ruling), SLOT (#1647 — unruled), PR #758 (dead), AdamW multi-epoch TTT, pre-quant TTT.
116+
117+
---
118+
119+
_Updated: 2026-04-16 (merged SOTA 1.0810 Day 7 no change; PR #1667 MarioPaerle new clean PR (1.07139, Attention Output Gate + SmearGate); PR #1670 dexhunter new best open (1.05970) but pending casefold ruling; PR #1647 SLOT-4 (1.0616) risky; casefold Issue #1604 open; 14 days remaining)_
120+
121+
---
122+
1123
# Parameter Golf Daily Research - 2026-04-15
2124

3125
## PR #771 STATUS: CLOSED (REJECTED) — no change

0 commit comments

Comments
 (0)