You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- PR #1758 (kilojoules, **1.02840**): PR #1738 + Pre-Quant TTT LR=1e-3 + Unfrozen — **⚠️ LIKELY ILLEGAL** (pre-quant TTT, same adapt-then-score pattern as PR #1351/#1408/#1416/#1735). No reviews. Do NOT track.
117
118
- PR #1698 (arsenis-cmd, **~~1.00995~~**): GatedDeltaNet (FLA) — **EFFECTIVELY DEAD**: BPB bug confirmed by dexhunter (double-count in `build_sentencepiece_luts`, actual ~1.189 BPB) + artifact size violation (16.47–16.60MB vs 16MB decimal limit). No organizer response; author lost GPU access. Do NOT track.
118
119
- PR #1738 (alertcat, **1.03540**): CaseOps V15 + PR #1735 Pre-Quant TTT — **⚠️ BUILDS ON ILLEGAL PR #1735** (pre-quant AdamW TTT 21ep, flagged by dexhunter). No reviews. Score likely void once PR #1735 is rejected.
- PR #1647 (powerpratik, **1.0616**): SLOT-4 + TTT + 3-Layer Recurrence + Parallel Residuals — ⚠️ standard SLOT, no reviews
136
139
**Best open with SLOT**: ~1.0616 val_bpb (PR #1647, powerpratik, SLOT-4) — no reviews yet
137
140
**Best open (illegal)**: 1.0632 (PR #1517, RulinShao, Pre-Quant TTT 18ep — same ruling as #1351/#1416)
138
-
**Target**: Beat 1.0810 merged SOTA by >=0.005 nats → need **≤1.0760 bpb**. Best reachable (legal, no CaseOps): ~1.068–1.072 (legal stack #1586+#1667+#1727+#1560). With CaseOps if ruled legal: ~1.065 (PR #1736). **11 days to deadline (Apr 30).**
141
+
**Target**: Beat 1.0810 merged SOTA by >=0.005 nats → need **≤1.0760 bpb**. Best reachable (legal, no CaseOps): ~1.068–1.072 (legal stack #1586+#1667+#1727+#1560). With CaseOps if ruled legal: ~1.065 (PR #1736/#1756). **9 days to deadline (Apr 30). Issue #1604 self-imposed deadline: Apr 24 — act without ruling if no response by then.**
139
142
140
143
**CRITICAL LEGALITY UPDATES**:
141
144
-**PR #771 REJECTED (2026-03-27)** — Our AdamW TTT 30ep was train-then-score. All 30-epoch TTT results void.
93.**Merged SOTA plateau now 10 days (Apr 9 → Apr 19) — deadline is 11 days away.** With 8+ open PRs in 1.062–1.078 range, a merge wave is overdue. Implementing #1586 immediately is critical — every day without it is wasted headroom.
405
410
406
411
_Updated: 2026-04-19 (v14.0 — PR #1698 GDN effectively dead (BPB bug ~1.189 + artifact violation); CaseOps bijective tokenizer new community technique (#1729/#1736/#1738); PR #1735 pre-quant TTT flagged illegal; PR #1727 MP-SGD TTT 4-phase appears legal at 1.07217; merged SOTA 1.0810 Day 10; 11 days remaining)_
412
+
413
+
### Session 18 (2026-04-21)
414
+
94.**Pre-quant TTT pattern (PR #1758, 1.02840) is the 6th attempt — still illegal.** kilojoules tuned PR #1738's pre-quant TTT with LR=1e-3 and unfrozen blocks to reach 1.02840. No reviews filed; no organizer response. Same adapt-then-score violation as PR #1351/#1408/#1416/#1423/#1735. The community keeps submitting; organizers keep rejecting. Do not track.
415
+
95.**Recurrence Depth Curriculum (PR #1756, romeerp, 1.06505) is a new technique with theoretical backing.** Three-phase training schedule: depth 1 (first third) → depth 3 (second third) → depth 4 (final third), eval fixed at depth 4. Grounded in arXiv:2511.07384 (retrofitted recurrence curriculum). Two issues: (a) awaits Issue #1604 CaseOps ruling; (b) `prepare_caseops_data.py` missing BOS insertion causes ZeroDivisionError in phased TTT eval path (@codemath3000 flagged Apr 21). Watch for fix. If CaseOps ruled legal and bug fixed, this is a strong stack candidate.
416
+
96.**Parcae (arXiv:2604.12946, UCSD + Together AI, Apr 16) directly addresses our Triple Loop instability.** Looped models suffer residual explosion from large spectral norms in injection parameters. Parcae constrains spectral norm via negative diagonal parameterization, achieving quality of a transformer **2× the size** at equal parameter count. Our PR #1493 Triple Loop (layers 4-5 × 3, activated at 0.35×) may be leaving performance on the table due to this same instability. If time permits after #1586+#1667+#1560 are validated, investigate adding Parcae-style spectral norm constraint on loop injection weights. GitHub: github.com/sandyresearch/parcae.
417
+
97.**Attention Output Gate (PR #1667) is backed by NeurIPS 2025 research (arXiv:2505.06708).** Head-specific sigmoid SDPA output gate breaks the low-rank bottleneck of consecutive Wv/Wo projections, yielding up to 0.2 PPL reduction. Multiplicative form (as in PR #1667) is optimal. Confirms our implementation target is theoretically sound. Implement with #1586.
418
+
98.**Issue #1604 (CaseOps/casefold ruling) has been open 8 days with no @valerio-oai response.** Self-impose a deadline: if no ruling by Apr 24 (6 days before competition close), proceed with the clean legal stack (#1586+#1667+#1560+#1727) rather than waiting. CaseOps has a stronger legal argument than casefold (bijective, lossless, BPB on original bytes), but the clock is running.
419
+
99.**Merged SOTA Day 12 plateau is now confirmed longest in competition history.** No merges since Apr 9. The 8+ open PRs between 1.062–1.078 remain unreviewed. The organizers may be reviewing multiple PRs in batch. Expect a wave when rulings come (especially Issue #1604). Check leaderboard at the start of every session.
420
+
421
+
_Updated: 2026-04-21 (v15.1 — merged SOTA 1.0810 Day 12; PR #1758 pre-quant TTT 1.02840 likely illegal; PR #1756 CaseOps+Recurrence Depth Curriculum 1.06505 has BOS bug + Issue #1604 pending; PR #1755 CaseOps+Legal TTT 1.07462 pending Issue #1604; Parcae arXiv:2604.12946 relevant to Triple Loop stability; Attention Output Gate backed by arXiv:2505.06708; Issue #1604 self-deadline Apr 24; 9 days remaining)_
|#758| 1.0465 |**OPEN (effectively dead)**| Apr 12: XOR hash key includes target token, same violation as #727|
209
+
|#731| 1.0400 |**OPEN — awaiting seeds 1337 + 2024**| Reviewer "LOOKS CLEAN". Dense count + Laplace, score-first per chunk. No movement since Apr 17. **9 days to deadline — if no seeds by Apr 24, this PR is unlikely to merge.**|
**Issue #1604 (CaseOps/casefold legality)**: STILL OPEN. No @valerio-oai comment as of Apr 21. **Ruling deadline self-imposed: Apr 24.** If no ruling by then, proceed without CaseOps.
253
+
254
+
### New technique: Recurrence Depth Curriculum (PR #1756)
255
+
256
+
romeerp introduces a three-phase training schedule for depth recurrence:
257
+
- Phase 1 (first third of training): loop depth = 1
258
+
- Phase 2 (second third): loop depth = 3
259
+
- Phase 3 (final third): loop depth = 4
260
+
- Evaluation: always at depth 4
261
+
262
+
Hypothesis: "teach a useful shallow refinement operator first" before requiring deeper recurrence. This is consistent with arXiv:2511.07384 (retrofitted recurrence curriculum). **If CaseOps is ruled legal and the BOS reproducibility bug is fixed, this technique is worth stacking on our base.**
**UCSD + Together AI.** Addresses instability in looped LMs caused by residual explosion (large spectral norms in injection parameters). Solution: constrain spectral norm via "negative diagonal parameterization" of injection parameters, recast as a nonlinear time-variant dynamical system.
270
+
271
+
Key results:
272
+
- 6.3% lower val perplexity vs prior looped models at same parameter count
273
+
- Achieves quality of transformer **2× the size**
274
+
- At 1.3B params: +2.99/+1.18 CORE/Core-Extended points vs Transformer baseline under fixed budget
275
+
- Predicts looping and training data should scale **in tandem** (not independently)
276
+
277
+
**Relevance to Parameter Golf**: Our Triple Loop architecture (layers 4-5 repeated 3×, activated at 0.35× training, from PR #1493) may suffer from residual explosion instability. Parcae's spectral norm constraint on injection parameters could stabilize our loops and allow deeper/more aggressive recurrence. Implementation complexity: moderate (add norm constraint to loop injection weights). **Watch for competition PRs implementing Parcae stabilization on the SP8192 stack.** GitHub: github.com/sandyresearch/parcae.
278
+
279
+
### arXiv:2511.07384 — Teaching Pretrained LMs to Think Deeper with Retrofitted Recurrence (Nov 2025)
280
+
Proposes curriculum over recurrence depth during training (depth increases from shallow to deep). Exactly the mechanism PR #1756 implements. Validates romeerp's approach with theoretical grounding.
281
+
282
+
**Relevance**: If CaseOps is ruled legal, adopting Recurrence Depth Curriculum (depth 1→3→4 curriculum) on our own stack is a natural experiment. Expected gain: unclear standalone from depth curriculum alone; PR #1756 bundles it with CaseOps. Low priority until CaseOps ruling.
- Improves training stability (reduces loss spikes)
290
+
- Gating scores are sparse (<0.5 for most heads)
291
+
292
+
**Relevance**: This is the theoretical backing for PR #1667's Attention Output Gate. Confirms that our target stack element (#1667) is theoretically sound and from published NeurIPS work. Also explains *why* it works: breaks the low-rank bottleneck of consecutive Wv/Wo projections.
293
+
294
+
---
295
+
296
+
## HuggingFace / Community Discoveries
297
+
298
+
-**Pre-quant TTT pattern continues**: PR #1758 (1.02840) is the 6th pre-quant TTT attempt (after #1351, #1408, #1416, #1423, #1735). Community keeps trying; organizers keep rejecting. Ignore.
299
+
-**Recurrence Depth Curriculum is emerging**: PR #1756 is the first competition PR to implement it. Has a reproducibility bug (BOS missing) — watch for author fix.
300
+
-**No new GDN attempts with corrected BPB**: PR #1749 (GDN + Full-Hessian GPTQ) from Apr 20 still awaits full 8xH100 run.
301
+
-**Parcae architecture from Together AI** (arXiv:2604.12946) could inspire stable loop injection technique — first paper to address exactly the instability pattern our depth recurrence faces.
302
+
303
+
---
304
+
305
+
## Recommended Actions (priority order)
306
+
307
+
1.**IMPLEMENT PR #1586 TODAY.** 9 days to deadline. Per-layer GPTQ (MLP=12σ, Attn=13σ, Emb int7@15σ), MLR=0.026. Config-level change, -0.013 nats, zero legality risk. This is critically overdue.
308
+
309
+
2.**STACK PR #1667 IN THE SAME RUN.** Attention Output Gate (1,056 params, init zero) + SmearGate (w=12). Combined expected: ~-0.019 nats total over merged SOTA base. Backed by NeurIPS 2025 paper (arXiv:2505.06708).
310
+
311
+
3.**ADD VarLen Attention + Doc-TTT (PR #1560 approach) in next run.**~-0.007 bpb vs merged SOTA. Per-document causal masking + score-first LoRA TTT (chunk=48). dexhunter-authored; reliable.
312
+
313
+
4.**AWAIT Issue #1604 until Apr 24 then act.** If CaseOps ruled legal before Apr 24: add bijective CaseOps from PR #1736/PR #1755 stack. If no ruling by Apr 24: submit without CaseOps. Do not wait past Apr 24 — 6 days will remain for 3-seed runs.
314
+
315
+
5.**DO NOT IMPLEMENT**: Pre-quant TTT (#1758/#1735), casefold without ruling, SLOT, GDN without corrected BPB.
316
+
317
+
6.**INVESTIGATE Parcae stabilization for Triple Loop**: If time permits after #1586+#1667+#1560 are in, look at whether spectral norm constraint on loop injection parameters can enable a 4th loop depth or earlier activation (currently at 0.35× training). Read github.com/sandyresearch/parcae.
318
+
319
+
---
320
+
321
+
_Updated: 2026-04-21 (v15.1 — Merged SOTA 1.0810 Day 12 plateau (longest ever); PR #1758 pre-quant TTT 1.02840 likely illegal; PR #1756 CaseOps+Recurrence Depth Curriculum 1.06505 awaits BOS fix + Issue #1604; PR #1755 CaseOps+Legal TTT 1.07462 awaits Issue #1604; Parcae stable looped LM paper arXiv:2604.12946 relevant to Triple Loop stability; 9 days to deadline)_
0 commit comments