Skip to content

Commit 19fb3a0

Browse files
committed
research(2026-04-12): merged SOTA 1.0810; VarLen+Doc-TTT; GDN-Hybrid 1.01710
Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09). Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493). New target: ≤1.0760 val_bpb. 18 days to deadline. Key findings: - GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review - VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next - TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560 - PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed - PR openai#758: major legality flags, do not implement Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9). Updated logs/daily_research.md: new 2026-04-12 entry prepended. https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss
1 parent 3cce9f5 commit 19fb3a0

2 files changed

Lines changed: 199 additions & 62 deletions

File tree

CLAUDE.md

Lines changed: 70 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -113,38 +113,44 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
113113
## Competition Strategy
114114

115115
**Merged leaderboard SOTA**: **1.0810 val_bpb** (bigbag, PR #1493, 2026-04-09) — UPDATED FROM 1.1147 (was stale)
116-
**Best open legal PRs (Apr 11)**:
117-
- PR #1541 (bigbag, **1.07785**): Improved Parallel Residuals (cross-lane learned scalars) + Muon 0.97 — ⚠️ hash embed flag pending clarification
116+
**Best open legal PRs (Apr 12 update)**:
117+
- PR #1560 (dexhunter, **1.07406**): VarLen Attention + Triton Fused MLP + Doc-TTT — appears legal (most recent best)
118+
- PR #1555 (andrewbaggio1, **1.07636**): TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1
119+
- PR #1541 (bigbag, **1.07785**): Improved Parallel Residuals (cross-lane learned scalars) + Muon 0.97 — ⚠️ hash embed flag pending
118120
- PR #1540 (aryanbhosale, **1.0777**): VarLen Attention + Doc-Independent LoRA TTT rank-96 + Triton TMA — appears legal
119-
- PR #1523 (EthanYangTW, 1.0778): Triple Recurrence + Banking + Fused MLP + Muon 0.97 — ⚠️ Eval-Time Hash Embedding may be flagged; PR #1514 (dexhunter, 1.07983) is cleaner
121+
- PR #1564 (joshkmartinez, **1.01710**): GDN-Hybrid (Gated DeltaNet + SWA), NO TTT/SLOT — extraordinary if verified; unreviewed
120122
**Best open with SLOT**: ~1.0766 val_bpb (PR #1333, aryanbhosale, Causal SLOT-16 on PR #1334 base) — no organizer rejection
121123
**Best open (illegal)**: 1.0632 (PR #1517, RulinShao, Pre-Quant TTT 18ep — same ruling as #1351/#1416)
122-
**Target**: Beat 1.0810 merged SOTA by >=0.005 nats → need **≤1.0760 bpb**. Best reachable: ~1.074–1.077 (legal stack). With SLOT: ~1.073–1.076.
124+
**Target**: Beat 1.0810 merged SOTA by >=0.005 nats → need **≤1.0760 bpb**. Best reachable: ~1.074–1.077 (legal). With SLOT: ~1.073–1.076. **18 days to deadline (Apr 30).**
123125

124126
**CRITICAL LEGALITY UPDATES**:
125-
- **PR #771 REJECTED (2026-03-27)** — Our AdamW TTT 30ep was train-then-score, not score-first. All 30-epoch TTT results are void.
126-
- **N-gram hash cache ILLEGAL** — PRs #727, #741 closed. PRs #731, #758 still open but unresolved.
127-
- **N-gram Tilt IS LEGAL (PR #1420)** — Normalized via softmax partition function Z: `p_tilt(t) = p_model(t) · exp(β · 1[t==hint]) / Z`. Causal (backward-looking only). -0.0029 bpb, zero artifact cost. **⚠️ PR #1420's kernel has a causality bug — use PR #1437's corrected implementation.**
128-
- **PR #1423 ILLEGAL (2026-04-07)** — Pre-quant TTT, same ruling as #1351/#1408/#1416.
129-
- **Score-first TTT ≤3 epochs IS LEGAL** — PR #1413: all blocks, lr=0.005, 3ep. -0.003 bpb.
130-
- **Pre-quant TTT ILLEGAL (all variants)** — PR #1351, #1416, #1408. Do NOT use.
131-
- **SLOT δ-vector: Issue #140 CLOSED (Apr 6), NO organizer ban**@valerio-oai NEVER commented in Issue #140. 9 record PRs use SLOT variants without rejection. @abaybektursun self-removed (causality concern) but no official rule. Causal SLOT-16 (PR #1333, 1.0766 BPB) is the current best open record claim. Scored-position SLOT (PR #1229) reached 0.9300 BPB. **RISK: causality concern unresolved; @valerio-oai could rule at any time on PRs. Implement only if willing to accept rejection risk.**
127+
- **PR #771 REJECTED (2026-03-27)** — Our AdamW TTT 30ep was train-then-score. All 30-epoch TTT results void.
128+
- **N-gram hash cache ILLEGAL** — PRs #727, #741 closed. PR #758 open but has major legality flags. PR #731 open (dense count tables + Laplace smoothing, reviewer says "LOOKS CLEAN", awaiting 3rd seed).
129+
- **N-gram Tilt IS LEGAL (PR #1420)** — Normalized via softmax Z. **⚠️ PR #1420 has causality bug — use PR #1437's corrected implementation.**
130+
- **Score-first TTT IS LEGAL** — ≤3ep confirmed (PR #1413). PR #1557 cites PR #1514 as precedent for 5ep — status uncertain; use ≤3ep to be safe.
131+
- **Pre-quant TTT ILLEGAL (all variants)** — PR #1351, #1416, #1408, #1423. Do NOT use.
132+
- **SLOT δ-vector: Issue #140 CLOSED (Apr 6), NO organizer ban**@valerio-oai NEVER commented in Issue #140. 9 record PRs use SLOT. Risk remains. Implement only if willing to accept rejection risk.
132133
- **ETLB UNRULED** — PR #1399/#1415; no ruling; -0.0019 bpb standalone. Await before implementing.
133-
134-
**Current approach (PR #1420 stack + legal TTT)**:
135-
1. **SP8192 vocab** — beats SP4096 by ~0.009 bpb (PR #1420 vs #1334)
136-
2. **Triple Loop (17 virtual layers)** — layers 4-5 repeated 3× (not 2×), activated at 0.35× training
137-
3. **Parallel Residuals (layers 7-10)** — GPT-J style, faster forward pass, tighter GPTQ calibration
138-
4. **MuonEq-R optimizer** — arXiv:2603.28254; in PR #1334, #1344, #1420
139-
5. **4× MLP expansion** — vs 3× in older SOTA
140-
6. **XSA on all 11 layers** — exclusive self-attention
141-
7. **GPTQ int6 + WD=0.085** — Hessian-aware quantization; SDClip variant in PR #1420
142-
8. **QK-Gain 5.0** — from PR #1334/#1420
143-
9. **N-gram Tilt** — -0.0029 bpb, legal, zero artifact cost — use **PR #1437 kernel** (not #1420, which has a causality bug)
144-
10. **Legal Score-First TTT (post-quant only)** — all blocks, 3ep, lr=0.005, score-first (PR #1413)
145-
11. **Fused Kernels** — Triton TMA (forward) + CUTLASS 3.x (backward), +10% throughput (+127 steps) — add last, complex
146-
147-
**Key reference PRs**: #1019 (merged SOTA 1.1147), #1333 (1.0766, Causal SLOT-16, open record — best with SLOT), #1437 (1.08091, causal-fixed N-gram Tilt kernel — use this), #1420 (1.08014, N-gram Tilt has causality bug), #1413 (1.08279, SP8192+Legal TTT), #1334 (1.0897, cleanest arch reference), #1229 (0.9300, scored-position SLOT, open), #1370 (1.003, Gated DeltaNet, non-record)
134+
- **GDN-Hybrid (PR #1564)**: No legality concerns — pure architecture, no TTT/SLOT. If organizer approves, it's the new gold standard at 1.01710.
135+
- **VarLen Attention + Doc-TTT (PR #1560)**: No legality flags — per-document masking is architectural, score-first TTT per-doc.
136+
- **Tap-In unigram matching (PR #1555)**: Legality UNCONFIRMED — verify before implementing (may be similar to n-gram approaches).
137+
138+
**Current best-stack approach (PR #1493 base + incremental adds)**:
139+
1. **SP8192 vocab** — beats SP4096 by ~0.009 bpb
140+
2. **Triple Loop (17 virtual layers)** — layers 4-5 repeated 3×, activated at 0.35× training
141+
3. **Parallel Residuals (layers 7-10)** — GPT-J style
142+
4. **MuonEq-R optimizer** — arXiv:2603.28254
143+
5. **4× MLP expansion**
144+
6. **GPTQ Embeddings (int8) + SDClip** — saves ~4MB artifact budget (PR #1394)
145+
7. **QK-Gain 5.25** — up from 5.0 (PR #1493)
146+
8. **WD=0.095, EMA=0.9965, warmdown=0.72** — tuned hypers from PR #1493
147+
9. **N-gram Tilt** — use PR #1437 corrected kernel only
148+
10. **Legal Score-First TTT (post-quant, ≤3ep)** — lr=0.005, all blocks
149+
11. **VarLen Attention (per-document causal masking)** — PR #1560, ~-0.007 bpb — **add next**
150+
12. **Doc-TTT (per-document score-first TTT)** — PR #1560, chunk size=48, Muon 0.97 — **add next**
151+
13. **TMA Megakernel (Triton TMA fused MLP)** — PR #1555, +10.5% throughput = ~200 extra steps — add after base validated
152+
153+
**Key reference PRs**: #1493 (merged SOTA 1.0810), #1560 (1.07406, best open safe — VarLen+Doc-TTT), #1564 (1.01710, GDN-Hybrid, extraordinary — monitor), #1555 (1.07636, TMA Megakernel+Tap-In), #1333 (1.07660, Causal SLOT-16 — risky), #1437 (1.08091, causal-fixed N-gram Tilt kernel — use this), #1413 (1.08279, SP8192+Legal TTT), #1334 (1.0897, arch reference), #1229 (0.9300, scored-position SLOT, open)
148154

149155
**Abandoned approaches**: Training-time static LoRA TTT (hurts), product quantization (SWA-incompatible), custom Triton kernels (poor EV — REVERTED: PR #1420 shows +10% via Triton TMA, revisit after base works), int4 without QAT (quality-destructive), eval stride=32 (time budget), AdamW TTT 30ep (illegal), n-gram hash cache (illegal), pre-quant TTT any form (illegal), Eval-Time Hash Embedding trained at inference (suspect illegal — same adapt-then-score pattern), Tap-In V6 document-local matching (await ruling).
150156
**NOTE**: Doc-Independent LoRA TTT (PR #1540, rank-96, resets per batch, score-first) is categorically DIFFERENT from abandoned LoRA TTT and appears legal — consider adopting.
@@ -155,42 +161,41 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py
155161

156162
| Technique | Approx Δ bpb | Status |
157163
|-----------|-------------|--------|
158-
| **Pre-quant TTT (any form, before GPTQ)** || **ILLEGAL — PR #1351, #1408, #1416 all illegal; pre-eval adaptation** |
159-
| **Standard SLOT δ-vector (arXiv:2505.12392)** | **-0.021** | **DE FACTO IN USE — Issue #140 CLOSED (Apr 6); 9 record PRs use SLOT variants; no organizer rejection. @valerio-oai never ruled in #140. @abaybektursun self-removed (causality concern) but no ban.** |
160-
| **Causal SLOT-16 (scored-position delta only)** | **-0.009** | **DE FACTO IN USE — PR #1333 (aryanbhosale, 1.0766 BPB, open record); PR #1229 (scored-position SLOT, 0.9300 BPB). No organizer rejection.** |
164+
| **Pre-quant TTT (any form, before GPTQ)** || **ILLEGAL — PR #1351, #1408, #1416, #1423 all illegal** |
165+
| **Standard SLOT δ-vector (arXiv:2505.12392)** | **-0.021** | **DE FACTO IN USE — Issue #140 CLOSED (Apr 6); 9 record PRs use SLOT variants; no organizer rejection** |
166+
| **Causal SLOT-16 (scored-position delta only)** | **-0.009** | **DE FACTO IN USE — PR #1333 (aryanbhosale, 1.0766 BPB, open record); PR #1229 (0.9300 BPB). No organizer rejection.** |
161167
| **Scored-Position SLOT (PR #1229)** | **~-0.18 vs base** | **Extraordinary — 0.9300 BPB; no organizer rejection; causality concern still present** |
162-
| **ETLB (Eval-Time Logit Bias)** | **-0.0019** | **UNRULED — PR #1399/#1415; no ruling from @valerio-oai; await before implementing** |
168+
| **ETLB (Eval-Time Logit Bias)** | **-0.0019** | **UNRULED — PR #1399/#1415; await before implementing** |
163169
| **N-gram Tilt (PR #1437 kernel)** | **-0.0029** | **LEGAL — properly normalized via Z; causal; zero artifact cost. PR #1420 has causality bug — use PR #1437** |
164-
| **Triple Loop (3× depth recurrence)** | **~-0.009 vs 2×** | **PRIMARY — PR #1420 (1.08014); 17 virtual layers; activate at 0.35× training** |
165-
| **SP8192 vocab** | **~-0.009 vs SP4096** | **PRIMARY — PR #1420/#1413; use over SP4096** |
166-
| **Fused Kernels (Triton TMA + CUTLASS 3.x)** | **+127 steps (~-0.002)** | **Legal — PR #1420; add last, complex; Triton TMA forward + CUTLASS backward** |
167-
| **Legal Score-First TTT (all blocks, 3ep)** | **-0.003** | **Legal — PR #1413; lr=0.005, inference_mode scoring before update** |
168-
| **Depth Recurrence + Parallel Residuals** | **~-0.015 vs baseline** | **In plan — PR #1334 (1.0897); upgrade to Triple Loop from PR #1420** |
169-
| **MuonEq-R optimizer** | **~-0.005** | **In plan — arXiv:2603.28254; PR #1334, #1420** |
170-
| **QK-Gain 5.0** | **~-0.006** | **In plan — PR #1334, #1420** |
171-
| **4× MLP expansion** | **~-0.01** | **In plan — PR #1218, #1334** |
170+
| **VarLen Attention + Doc-TTT** | **~-0.007** | **LEGAL — PR #1560 (dexhunter, 1.07406 BPB); per-document causal masking + score-first TTT per-doc; LoRA chunk=48** |
171+
| **TMA Megakernel (Triton Hopper fused MLP)** | **+200 steps (~-0.002)** | **LEGAL — PR #1555; +10.5% throughput; add after base validated** |
172+
| **Tap-In Unigram Matching (min_match=1)** | **~-0.009** | **LEGALITY UNCONFIRMED — PR #1555; 21% activation rate; verify before implementing** |
173+
| **GDN-Hybrid Architecture (Gated DeltaNet + SWA)** | **~-0.064 vs merged SOTA** | **LEGAL (safe, no TTT) — PR #1564 (joshkmartinez, 1.01710 BPB, OPEN, unreviewed); 5 GDN layers + SWA; SP1024 base; extraordinary if verified** |
174+
| **Triple Loop (3× depth recurrence)** | **~-0.009 vs 2×** | **IN MERGED SOTA — PR #1493 (1.0810); 17 virtual layers; activate at 0.35× training** |
175+
| **SP8192 vocab** | **~-0.009 vs SP4096** | **IN MERGED SOTA — PR #1493** |
176+
| **GPTQ Embeddings (int8) + SDClip** | **~-0.003 + artifact** | **IN MERGED SOTA — PR #1394; saves ~4MB artifact budget** |
177+
| **QK-Gain 5.25** | **~-0.001 vs 5.0** | **IN MERGED SOTA — PR #1493** |
178+
| **Legal Score-First TTT (all blocks, 3ep)** | **-0.003** | **IN MERGED SOTA — PR #1413; lr=0.005; ≤3ep safe; 5ep cited in PR #1557 (refs #1514) — use ≤3ep to be safe** |
179+
| **Parallel Residuals (layers 7-10)** | **~-0.008** | **IN MERGED SOTA — PR #1493** |
180+
| **MuonEq-R optimizer** | **~-0.005** | **IN MERGED SOTA — arXiv:2603.28254** |
181+
| **4× MLP expansion** | **~-0.01** | **IN MERGED SOTA** |
172182
| SP4096 vocab | ~-0.02 vs SP1024 | Superseded by SP8192 |
173183
| Sliding window eval (stride=64) | -0.032 | In SOTA |
174-
| AR Self-Gen GPTQ calibration | ~-0.005 | In merged SOTA (PR #1019) |
175-
| XSA (all 11 layers) | -0.002 to -0.005 | In merged SOTA |
176-
| EMA decay 0.9965 (vs 0.997) | ~-0.002 | PR #1421 (1.0925); tighter GPTQ calibration |
184+
| AR Self-Gen GPTQ calibration | ~-0.005 | In older merged SOTA (PR #1019) |
185+
| XSA (all 11 layers) | -0.002 to -0.005 | In older merged SOTA |
186+
| EMA decay 0.9965 (vs 0.997) | ~-0.002 | In merged SOTA (PR #1493 uses 0.9965) |
177187
| 3× MLP expansion | -0.015 | In older SOTA |
178188
| Int6 QAT | -0.010 | In SOTA |
179189
| SmearGate + BigramHash(4096) | -0.006 | In older SOTA |
180190
| Value Residual (ResFormer) | -0.005 to -0.017 | In older SOTA |
181-
| 11 layers | -0.003 | In SOTA |
182-
| EMA (0.997) + SWA (every 50) | -0.002 | In SOTA |
183-
| Partial RoPE (16/64) + LN Scale | -0.002 | In SOTA |
184-
| Gated DeltaNet (PR #1370) | ~-0.11 vs baseline | Non-record (>10 min); O(n) linear attention |
185-
| **MuonEq-R (arXiv:2603.28254)** | **~-0.005** | **NOW — drop-in Muon swap; normalize row norms before Newton-Schulz; O(m+n) overhead; zero artifact cost** |
186-
| **Cooldown+QAT fusion (arXiv:2509.22935)** | **~-0.002** | **NOW — do LR decay jointly with QAT activation; no artifact size change; Apple ML Research** |
187-
| **LaCT large-chunk TTT (arXiv:2505.23884)** | GPU util 0→70% | Target — better hardware use for post-quant TTT; code at github.com/a1600012888/LaCT |
188-
| **SGT sparse depth recurrence (arXiv:2603.23998)** | saves FLOP budget | Watch — reduces Triple Loop FLOP overhead 16-20% → 1-3% |
189-
| **Early-exit depth recurrence (arXiv:2509.23314)** | saves eval budget | Watch — skip loop iterations when step-size delta below threshold |
190-
| Newton-Muon (arXiv:2604.01472) | ~+4-6% steps | WATCH — Apr 2026, untested; try after MuonEq-R confirmed |
191-
| MUD/MomentUm Decorrelation (arXiv:2603.17970) | +20-50% throughput | WATCH — Mar 2026; replaces Newton-Schulz with triangular Cholesky whitening; 1.3–2.6× tokens/sec vs Muon; lower per-step quality than MuonEq-R TBD |
192-
| Mousse (arXiv:2603.09697) | ~-0.002 to -0.003 | WATCH — Mar 2026; Kronecker-factored preconditioning for Muon; ~12% fewer steps; overhead risk at H100 scale |
193-
| Infini-gram interpolation (arXiv:2401.17377) | large but legal unclear | WATCH — suffix array ∞-gram, normalized; legal if score-first; high impl cost |
191+
| Gated DeltaNet (PR #1370) | ~-0.11 vs baseline | **Non-record (>10 min) — but PR #1564 GDN-Hybrid claims 10-min compliance at 1.01710** |
192+
| **Cooldown+QAT fusion (arXiv:2509.22935)** | **~-0.002** | **WATCH — LR decay jointly with QAT; no artifact size change** |
193+
| **LaCT large-chunk TTT (arXiv:2505.23884)** | GPU util 0→70% | WATCH — PR #1560 Doc-TTT may be LaCT-style; dexhunter already implementing |
194+
| **SGT sparse depth recurrence (arXiv:2603.23998)** | saves FLOP budget | Watch — reduces Triple Loop FLOP overhead |
195+
| Newton-Muon (arXiv:2604.01472) | ~+4-6% steps | WATCH — Apr 2026, untested |
196+
| MUD/MomentUm Decorrelation (arXiv:2603.17970) | +20-50% throughput | WATCH — triangular Cholesky whitening; 1.3–2.6× tokens/sec vs Muon |
197+
| Mousse (arXiv:2603.09697) | ~-0.002 to -0.003 | WATCH — Kronecker-factored preconditioning for Muon; ~12% fewer steps |
198+
| Infini-gram interpolation (arXiv:2401.17377) | large but legal unclear | WATCH — suffix array ∞-gram, normalized |
194199
| AdamW TTT (30 ep, train-then-score) || **ILLEGAL (PR #771 rejected)** |
195200
| N-gram hash cache || **ILLEGAL (normalization, Issue #1017)** |
196201
| LoRA TTT | **+0.004 (HURTS)** | **Abandoned** |
@@ -309,3 +314,14 @@ Every change must answer: "Does this lower val_bpb within the 16MB/10-min constr
309314
53. **MATRIX_LR = 0.03 pairs with Muon momentum 0.97.** Both PRs #1541 and #1523 co-tune these. When reducing momentum from 0.99 → 0.97, also reduce MATRIX_LR. Check whether our base config uses 0.03 or 0.05.
310315

311316
_Updated: 2026-04-11 (v11.5 — PR #1541 bigbag 1.07785 + PR #1540 aryanbhosale 1.0777 new open PRs; doc-independent LoRA TTT appears legal; PR #1545 BPB bug; MATRIX_LR 0.03 pairs with momentum 0.97; no merged SOTA change)_
317+
### Session 11 (2026-04-12)
318+
54. **Merged SOTA jumped from 1.1147 to 1.0810 in 5 days.** Six PRs merged between Apr 4–9 (PRs #1334, #1285, #1394, #1412, #1413, #1477, #1493). The competition accelerated dramatically. Check leaderboard every session before planning — yesterday's target may already be beaten.
319+
55. **The merged SOTA stack is now fully defined: SP8192 + Triple Recurrence + Parallel Residuals + QK-Gain 5.25 + GPTQ Emb (int8) + SDClip + WD=0.095 + EMA 0.9965 + Legal TTT.** PR #1493 (bigbag) at 1.0810. Any new submission must beat this cleanly. Target: ≤1.0760.
320+
56. **VarLen Attention (per-document masking) is the next clear win.** PR #1560 (dexhunter) achieves 1.07406 BPB by adding per-document causal masking + Doc-TTT (per-document score-first LoRA TTT, chunk=48) on top of the PR #1413 stack. -0.009 bpb vs merged SOTA. Implement this next.
321+
57. **GDN-Hybrid (PR #1564) at 1.01710 BPB is extraordinary — watch closely.** Gated DeltaNet + SWA architecture, no TTT/SLOT, SP1024. If organizers approve, this represents a ~0.064 bpb architectural leap with no eval-time techniques. Do not implement until organizer review; replicate if approved.
322+
58. **TMA Megakernel (Triton Hopper) gives +200 training steps.** PR #1555 shows +10.5% throughput on H100 via TMA-fused MLP kernel. Worth implementing after VarLen+Doc-TTT is verified. Combined with Tap-In (min_match=1, 21% activation), PR #1555 reaches 1.07636.
323+
59. **Do NOT implement Tap-In before verifying legality.** "Tap-In Unigram Matching" from PR #1555 activates at 21% of positions vs 1.7% at min_match=3. Mechanism involves token-level unigram cache — may be similar to n-gram approaches. Verify it's properly normalized before GPU spend.
324+
60. **PR #731 n-gram is now looking clean.** Dense count tables + Laplace smoothing (not hash caches). Reviewer said "LOOKS CLEAN" — waiting on seeds 1337 and 2024 to confirm 1.0400 BPB. If merged, this gives a legal n-gram mixer alternative.
325+
61. **18 days remain. Prioritize safe incremental improvements over risky architecture rewrites.** VarLen+Doc-TTT (PR #1560 approach) is the lowest-risk path to beating the target. File that first, then consider GDN-Hybrid rewrite if approved.
326+
327+
_Updated: 2026-04-12 (v12.1 — merged SOTA 1.0810 (PR #1493, Apr 9); 6 new merges; GDN-Hybrid 1.01710 open; VarLen+Doc-TTT 1.07406 open; target ≤1.0760; 18 days remaining)_

0 commit comments

Comments
 (0)