Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)#1334
Conversation
…al_bpb 1.0897 (3-seed mean) Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation. SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0. 3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.
… Parallel Residuals path - PR openai#771 confirmed CLOSED/REJECTED (train-then-score TTT) - N-gram PRs openai#727/openai#741 CLOSED (illegal); openai#758/openai#731 open but same risk - Merged SOTA unchanged at 1.1147 - New high-EV targets: PR openai#1351 (Discriminative TTT, 1.0807) and PR openai#1334 (SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R, 1.0897) - SLOT still unruled in Issue openai#140 — blocked until @valerio-oai rules - CLAUDE.md updated to v8.0 with corrected strategy and Session 5 lessons https://claude.ai/code/session_01X5rVjJpYyqm8DuWTNy2gkt
Comprehensive analysis of current leaderboard state (Apr 4, 2026): - Non-SLOT frontier at 1.0897 BPB (PR openai#1334) - Pre-quant TTT adds -0.009 BPP (PR openai#1351, 1.0807 BPB) - Causal SLOT adds -0.088 BPP (PR openai#1350, 1.0046 BPB) - GPTQ+TTT incompatibility confirmed post-quant, works pre-quant - FiLM gap analysis: ~0.05-0.09 BPP behind frontier - Three strategic paths identified Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hypothesis: Polar Express 4-step minimax NS on top of full PR openai#1334 stack Expected delta: ~-0.001 to -0.002 BPB from 1.0897 baseline Key changes vs PR openai#1334: - Polar Express Newton-Schulz (4-step minimax coefficients, arXiv:2505.16932) - MATRIX_LR=0.022 (validated for WD=0.090) - MUON_WD=0.090 (PR openai#1285/1334 optimal for 2-layer recurrence) - NoPE explicitly disabled (nope_every_n=0) after critique - Trackio experiment tracking added Stack: SP4096 vocab + MLP 4x + WD=0.090 + MuonEq-R + QK-Gain 5.0 + Depth recurrence L4-5 (step 3000) + Parallel residuals L7+ + Brotli
v2 (focal+warmstart+clamp) gives identical 1.2658 BPB to v1 L-BFGS. L-BFGS converges too fast for these tricks to matter. Competitiveness analysis: - FiLM beats SOTA by -0.095 BPP on 1×H100 - Extrapolated 8×H100: ~1.00-1.05 BPB - Should beat non-SLOT frontier (PR openai#1334: 1.09) - Uncertain vs causal SLOT frontier (PR openai#1350: 1.00) because our causal SLOT gives -0.035 vs their -0.087 8×H100 test is worth running. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lysis Novel ideas explored (Bitter Lesson aligned): - GDN hybrid: KILLED — FA3 is 3-16x faster than GDN on H100 - ACT transformer: KILLED — no training speedup (all iters must run for gradients) - 3x5 (512d): 517ms/step, 1.893 BPB vs baseline 331ms/step, 1.722 BPB - 3x5 (768d): 923ms/step, ~2.08 BPB — wider doesn't help - Root cause: ACT only helps when computation can actually be skipped during training Competition frontier analysis: - Legal record frontier: 1.005 BPB (PR openai#1350, L-BFGS causal SLOT) - Clean base frontier: 1.0897 BPB (PR openai#1334, SP4096+DepthRecur+MuonEq-R) - SLOT adds -0.087 BPB on top of base Remaining novel ideas to test: parallel SLOT beams, amortized SLOT, learned weight compression, progressive depth training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applies Cautious Muon (arXiv:2411.16085) to mask Muon optimizer updates where Newton-Schulz direction disagrees with raw gradient sign. Built on PR openai#1334 base with SP4096, depth recurrence, parallel residuals, MuonEq-R, QK-Gain 5.0, GPTQ INT6 + Brotli. 3-seed mean: 1.1604 bpb (seeds 42, 314, 999) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
primary path - CRITICAL: PR openai#1351 (Discriminative TTT, 1.0807) self-closed by author on 2026-04-05 — pre-quant AdamW TTT ruled as pre-eval adaptation on val data. Removed pre-quant TTT from technique table and plan. - Updated strategy to PR openai#1334 (Depth Recur + Parallel Residuals + MuonEq-R, 1.0897) as primary architecture target — zero legality flags. - Logged new PRs: openai#1379 (0.4162, n-gram mixer), openai#1376 (0.7094, SLOT-24 + pre-quant TTT), openai#1364 (1.1025, pre-quant TTT at risk), openai#1370 (1.003, GDN). - SLOT and pre-quant TTT both blocked; discriminative TTT post-quant still legal. - Updated CLAUDE.md Competition Strategy + Technique Reference + Lessons (v9.0). https://claude.ai/code/session_01RTLvTuYBp9YMtudwrY8mYM
…etermines training length)
…0.024 late) - LeakyReLU negative_slope 0.5 -> 0.9 (Issue openai#140 sweep evidence) - Split-LR: layers 0-5 at 0.020, layers 6-10 at 0.024 (PR openai#1179) - WD=0.090 and Brotli-11 already in openai#1334 base (no change needed)
3-seed mean: 1.0925 BPB (sliding window stride=64) Beats merged SOTA (1.1147) by 0.0222 BPB. Built on PR openai#1334 (@aryanbhosale) depth recurrence architecture with EMA decay tuned to 0.9965 for stabilized post-quantization. Seeds: 42 (1.0921), 1337 (1.0928), 2024 (1.0926) All artifacts under 16MB. 8xH100 SXM, 590s training.
….0889 3-seed mean: 1.0889 BPB (sliding window stride=64) Beats merged SOTA (1.1147) by 0.0258 BPB. Stacks 3-layer recurrence (3,4,5), WD=0.095, MLR=0.022, EMA decay=0.9965, early recurrence (step 2000), extended warmdown (72%) on PR openai#1334 architecture. Seeds: 42 (1.0885), 1337 (1.0894), 2024 (1.0888) All artifacts under 16MB. 8xH100 SXM, 590s training.
…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)
…1.01710 Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09). Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493). New target: ≤1.0760 val_bpb. 18 days to deadline. Key findings: - GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review - VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next - TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560 - PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed - PR openai#758: major legality flags, do not implement Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9). Updated logs/daily_research.md: new 2026-04-12 entry prepended. https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss
Layers 3,4,5 share MLP weights; attention weights stay unique per layer. Weight decay bumped to 0.09 (from 0.04) to regularize the shared MLP. Based on PRs openai#1334/openai#1344 which report 1.089-1.092 BPB with this setup. Why this works now when our prior attempt failed: - Prior: shared ALL layer weights -> quant error amplified 900x - Now: share ONLY MLP, keep attention unique -> per-layer discrimination - Higher WD regularizes against per-layer overfitting - Full Hessian GPTQ correctly accumulates Hessians across sharers Saves ~6.3 MB of parameters. The reinvest budget is the whole point: wider MLP, larger BigramHash, more unique layers, or higher-precision quantization for critical layers. GPTQ integration: forward pass accumulates Hessians under a shared key, quantizes the shared weight once using the combined Hessian, dedupes in _rebank_state_dict when constructing the export bank.
Layers 3,4,5 share MLP weights; attention weights stay unique per layer. Weight decay bumped to 0.09 (from 0.04) to regularize the shared MLP. Based on PRs openai#1334/openai#1344 which report 1.089-1.092 BPB with this setup. Why this works now when our prior attempt failed: - Prior: shared ALL layer weights -> quant error amplified 900x - Now: share ONLY MLP, keep attention unique -> per-layer discrimination - Higher WD regularizes against per-layer overfitting - Full Hessian GPTQ correctly accumulates Hessians across sharers Saves ~6.3 MB of parameters. The reinvest budget is the whole point: wider MLP, larger BigramHash, more unique layers, or higher-precision quantization for critical layers. GPTQ integration: forward pass accumulates Hessians under a shared key, quantizes the shared weight once using the combined Hessian, dedupes in _rebank_state_dict when constructing the export bank.
|
@aryanbhosale maintainer repro note: we are doing a pass over merged leaderboard rows and I cannot currently reproduce this record closely enough. What I ran:
Current result:
The visible hparams and the train-shard count now match, but the repro is still slower after recurrence / stops earlier and enters GPTQ with a larger int6 artifact ( Could you reply here with any missing exact setup details that might explain this? In particular: exact runtime/container, Python/PyTorch/CUDA/FlashAttention versions, dataset snapshot/prep command, or any env vars not captured in the submitted log. If there is an updated record/log bundle that reproduces the submitted numbers, please point us to it or push it so we can rerun. |
Hey @cocohearts, thanks for digging into this — and sorry the bundle didn't carry enough to repro cleanly the first time. I think the gap is kernel-level, not config. Here's why I'm fairly sure: The env actually came from PR #1019 (@abaybektursun's record, which #1334 is built on top of). Its I ran #1334 on the same pod with that same install untouched — my The reason I'm pretty confident this is environmental: I ran three different SP4096 architectures on April 3–4 against that same install, and seed-42 stopped at almost exactly the same step every time:
Same If you can pin the audit container to:
…I'd expect seed-42 to land back in the 5440–5454 band and the int6 artifact to fit at 16.00 MB without prune cycles. One thing worth flagging since you're reading the log directly: The rental pod's gone so I can't pull its container hash, but I do still have the FA3 wheel artifact I built. Happy to send it directly, or push a |
Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0
val_bpb = 1.0897 (3-seed mean, std 0.0003) | ~15.99 MB | 8×H100 SXM
Track A — Fixed Predictor (No eval-time adaptation)
3-Seed Results
Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0250 BPB.
Key Techniques
Compliance
Reproduction
Credits
PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee