Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0

aryanbhosale · 2026-04-04T09:32:35Z

val_bpb = 1.0897 (3-seed mean, std 0.0003) | ~15.99 MB | 8×H100 SXM

Track A — Fixed Predictor (No eval-time adaptation)

3-Seed Results

Seed	Sliding BPB	Artifact
42	1.0894	15,999,533
314	1.0898	15,992,752
999	1.0899	15,988,473
Mean	1.0897

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0250 BPB.

Key Techniques

4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
Depth Recurrence (layers 4,5) — virtual 13-layer from 11 physical. PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
Parallel Residuals (from layer 7) — separate attn/MLP lanes. PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
MuonEq-R — arXiv:2603.28254. PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
Full GPTQ int6 + Brotli + LZMA Compressed Wrapper (~24KB code)

Compliance

No TTT — no weight updates during evaluation
No SLOT — no eval-time delta optimization
No n-gram cache — no eval-time statistics
No eval-time adaptation of any kind — model weights completely frozen
GPTQ calibration within training budget
Standard autoregressive sliding-window eval (stride=64)
All four conditions from Issue A Field Guide to Valid Submissions #1017 satisfied

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1218 @clarkkev, PR #1285 @dexhunter, PR #1204 @msisovic, PR #1289 @MatoTeziTanka, PR #1260 @dexhunter, PR #1019 @abaybektursun, PR #1287 @dentity007, PR #1217 @bigbag, PR #493 @parinzee

…al_bpb 1.0897 (3-seed mean) Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation. SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0. 3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.

@valerio-oai

… Parallel Residuals path - PR openai#771 confirmed CLOSED/REJECTED (train-then-score TTT) - N-gram PRs openai#727/openai#741 CLOSED (illegal); openai#758/openai#731 open but same risk - Merged SOTA unchanged at 1.1147 - New high-EV targets: PR openai#1351 (Discriminative TTT, 1.0807) and PR openai#1334 (SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R, 1.0897) - SLOT still unruled in Issue openai#140 — blocked until @valerio-oai rules - CLAUDE.md updated to v8.0 with corrected strategy and Session 5 lessons https://claude.ai/code/session_01X5rVjJpYyqm8DuWTNy2gkt

Comprehensive analysis of current leaderboard state (Apr 4, 2026): - Non-SLOT frontier at 1.0897 BPB (PR openai#1334) - Pre-quant TTT adds -0.009 BPP (PR openai#1351, 1.0807 BPB) - Causal SLOT adds -0.088 BPP (PR openai#1350, 1.0046 BPB) - GPTQ+TTT incompatibility confirmed post-quant, works pre-quant - FiLM gap analysis: ~0.05-0.09 BPP behind frontier - Three strategic paths identified Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hypothesis: Polar Express 4-step minimax NS on top of full PR openai#1334 stack Expected delta: ~-0.001 to -0.002 BPB from 1.0897 baseline Key changes vs PR openai#1334: - Polar Express Newton-Schulz (4-step minimax coefficients, arXiv:2505.16932) - MATRIX_LR=0.022 (validated for WD=0.090) - MUON_WD=0.090 (PR openai#1285/1334 optimal for 2-layer recurrence) - NoPE explicitly disabled (nope_every_n=0) after critique - Trackio experiment tracking added Stack: SP4096 vocab + MLP 4x + WD=0.090 + MuonEq-R + QK-Gain 5.0 + Depth recurrence L4-5 (step 3000) + Parallel residuals L7+ + Brotli

v2 (focal+warmstart+clamp) gives identical 1.2658 BPB to v1 L-BFGS. L-BFGS converges too fast for these tricks to matter. Competitiveness analysis: - FiLM beats SOTA by -0.095 BPP on 1×H100 - Extrapolated 8×H100: ~1.00-1.05 BPB - Should beat non-SLOT frontier (PR openai#1334: 1.09) - Uncertain vs causal SLOT frontier (PR openai#1350: 1.00) because our causal SLOT gives -0.035 vs their -0.087 8×H100 test is worth running. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lysis Novel ideas explored (Bitter Lesson aligned): - GDN hybrid: KILLED — FA3 is 3-16x faster than GDN on H100 - ACT transformer: KILLED — no training speedup (all iters must run for gradients) - 3x5 (512d): 517ms/step, 1.893 BPB vs baseline 331ms/step, 1.722 BPB - 3x5 (768d): 923ms/step, ~2.08 BPB — wider doesn't help - Root cause: ACT only helps when computation can actually be skipped during training Competition frontier analysis: - Legal record frontier: 1.005 BPB (PR openai#1350, L-BFGS causal SLOT) - Clean base frontier: 1.0897 BPB (PR openai#1334, SP4096+DepthRecur+MuonEq-R) - SLOT adds -0.087 BPB on top of base Remaining novel ideas to test: parallel SLOT beams, amortized SLOT, learned weight compression, progressive depth training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Applies Cautious Muon (arXiv:2411.16085) to mask Muon optimizer updates where Newton-Schulz direction disagrees with raw gradient sign. Built on PR openai#1334 base with SP4096, depth recurrence, parallel residuals, MuonEq-R, QK-Gain 5.0, GPTQ INT6 + Brotli. 3-seed mean: 1.1604 bpb (seeds 42, 314, 999) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

primary path - CRITICAL: PR openai#1351 (Discriminative TTT, 1.0807) self-closed by author on 2026-04-05 — pre-quant AdamW TTT ruled as pre-eval adaptation on val data. Removed pre-quant TTT from technique table and plan. - Updated strategy to PR openai#1334 (Depth Recur + Parallel Residuals + MuonEq-R, 1.0897) as primary architecture target — zero legality flags. - Logged new PRs: openai#1379 (0.4162, n-gram mixer), openai#1376 (0.7094, SLOT-24 + pre-quant TTT), openai#1364 (1.1025, pre-quant TTT at risk), openai#1370 (1.003, GDN). - SLOT and pre-quant TTT both blocked; discriminative TTT post-quant still legal. - Updated CLAUDE.md Competition Strategy + Technique Reference + Lessons (v9.0). https://claude.ai/code/session_01RTLvTuYBp9YMtudwrY8mYM

…897)

…etermines training length)

…0.024 late) - LeakyReLU negative_slope 0.5 -> 0.9 (Issue openai#140 sweep evidence) - Split-LR: layers 0-5 at 0.020, layers 6-10 at 0.024 (PR openai#1179) - WD=0.090 and Brotli-11 already in openai#1334 base (no change needed)

@aryanbhosale

3-seed mean: 1.0925 BPB (sliding window stride=64) Beats merged SOTA (1.1147) by 0.0222 BPB. Built on PR openai#1334 (@aryanbhosale) depth recurrence architecture with EMA decay tuned to 0.9965 for stabilized post-quantization. Seeds: 42 (1.0921), 1337 (1.0928), 2024 (1.0926) All artifacts under 16MB. 8xH100 SXM, 590s training.

….0889 3-seed mean: 1.0889 BPB (sliding window stride=64) Beats merged SOTA (1.1147) by 0.0258 BPB. Stacks 3-layer recurrence (3,4,5), WD=0.095, MLR=0.022, EMA decay=0.9965, early recurrence (step 2000), extended warmdown (72%) on PR openai#1334 architecture. Seeds: 42 (1.0885), 1337 (1.0894), 2024 (1.0888) All artifacts under 16MB. 8xH100 SXM, 590s training.

…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)

…1.01710 Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09). Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493). New target: ≤1.0760 val_bpb. 18 days to deadline. Key findings: - GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review - VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next - TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560 - PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed - PR openai#758: major legality flags, do not implement Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9). Updated logs/daily_research.md: new 2026-04-12 entry prepended. https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss

Layers 3,4,5 share MLP weights; attention weights stay unique per layer. Weight decay bumped to 0.09 (from 0.04) to regularize the shared MLP. Based on PRs openai#1334/openai#1344 which report 1.089-1.092 BPB with this setup. Why this works now when our prior attempt failed: - Prior: shared ALL layer weights -> quant error amplified 900x - Now: share ONLY MLP, keep attention unique -> per-layer discrimination - Higher WD regularizes against per-layer overfitting - Full Hessian GPTQ correctly accumulates Hessians across sharers Saves ~6.3 MB of parameters. The reinvest budget is the whole point: wider MLP, larger BigramHash, more unique layers, or higher-precision quantization for critical layers. GPTQ integration: forward pass accumulates Hessians under a shared key, quantizes the shared weight once using the combined Hessian, dedupes in _rebank_state_dict when constructing the export bank.

cocohearts · 2026-05-04T02:55:17Z

@aryanbhosale maintainer repro note: we are doing a pass over merged leaderboard rows and I cannot currently reproduce this record closely enough.

What I ran:

Exact submitted record archive for PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334.
Same seed 42, same visible hparameter block, same exact ISO FlashAttention wheel we are using for the audit.
I noticed our first repro had staged train_shards: 143 while your submitted log says train_shards: 80, so I reran with the staged dataset forced to 80 shards.

Current result:

Submitted seed 42: final_int6_sliding_window val_bpb=1.08938974, stop step 5453.
Targeted repro seed 42 with train_shards: 80: final_int6_sliding_window val_bpb=1.09250239, stop step 5247.
Initial 3-seed repro mean was 1.09215243 vs submitted mean 1.08971631.

The visible hparams and the train-shard count now match, but the repro is still slower after recurrence / stops earlier and enters GPTQ with a larger int6 artifact (unpruned=16.03MB, requiring pruning) whereas the submitted seed42 log already fit at about 16.00MB.

Could you reply here with any missing exact setup details that might explain this? In particular: exact runtime/container, Python/PyTorch/CUDA/FlashAttention versions, dataset snapshot/prep command, or any env vars not captured in the submitted log. If there is an updated record/log bundle that reproduces the submitted numbers, please point us to it or push it so we can rerun.

aryanbhosale · 2026-05-04T06:53:01Z

@aryanbhosale maintainer repro note: we are doing a pass over merged leaderboard rows and I cannot currently reproduce this record closely enough.

What I ran:

Exact submitted record archive for PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334.

Same seed 42, same visible hparameter block, same exact ISO FlashAttention wheel we are using for the audit.

I noticed our first repro had staged train_shards: 143 while your submitted log says train_shards: 80, so I reran with the staged dataset forced to 80 shards.

Current result:

Submitted seed 42: final_int6_sliding_window val_bpb=1.08938974, stop step 5453.

Targeted repro seed 42 with train_shards: 80: final_int6_sliding_window val_bpb=1.09250239, stop step 5247.

Initial 3-seed repro mean was 1.09215243 vs submitted mean 1.08971631.

The visible hparams and the train-shard count now match, but the repro is still slower after recurrence / stops earlier and enters GPTQ with a larger int6 artifact (unpruned=16.03MB, requiring pruning) whereas the submitted seed42 log already fit at about 16.00MB.

Could you reply here with any missing exact setup details that might explain this? In particular: exact runtime/container, Python/PyTorch/CUDA/FlashAttention versions, dataset snapshot/prep command, or any env vars not captured in the submitted log. If there is an updated record/log bundle that reproduces the submitted numbers, please point us to it or push it so we can rerun.

Hey @cocohearts, thanks for digging into this — and sorry the bundle didn't carry enough to repro cleanly the first time.

I think the gap is kernel-level, not config. Here's why I'm fairly sure:

The env actually came from PR #1019 (@abaybektursun's record, which #1334 is built on top of). Its submission.json pins the stack explicitly:

pytorch_version: "2.9.1+cu128"
cuda_version: "12.8"
flash_attn_version: "2.8.3 (FA3 Hopper kernels)"
hardware: "8xH100 80GB SXM"

I ran #1334 on the same pod with that same install untouched — my submission.json only re-stated pytorch_version and dropped the other two fields, which is on me. So it's flash-attn 2.8.3 (FA3 Hopper) + torch 2.9.1+cu128 + CUDA 12.8, not FA2, not torch 2.11.

The reason I'm pretty confident this is environmental: I ran three different SP4096 architectures on April 3–4 against that same install, and seed-42 stopped at almost exactly the same step every time:

Submission	seed-42 stop	train_time	Log
`sp4096-depth-recurrence-muoneqr` (#1334's lineage)	5453	590 026 ms	train_seed42.log
`sp4096-causal-slot` (different head)	5454	590 101 ms	train_seed42.log
`sp4096-v6-ttt` (#1334 trunk + TTT)	5440	590 016 ms	train_seed42.log

Same model_params: 34401372, same train_shards: 80, same wallclock cap — three architecturally different runs all landed in a 14-step band at 5440–5454. Your audit at 5247 is ~200 below the whole cluster (~3.8% slower per step). A config or data drift would shift one of those, not pull all three down by the same amount, so I really do think it's the kernel.

If you can pin the audit container to:

torch==2.9.1+cu128 (the cu128 PyPI wheel, not cu130)
FA3 source-built from flash-attn==2.8.3, hopper/ subdir, against torch 2.9.1+cu128 + CUDA 12.8 toolkit
inductor cache pre-warmed (the wrapper has three torch.compile(fullgraph=True, dynamic=False) sites, plus a recompile when RECUR_START_STEP=3000 flips the model to virtual-13L — on a cold cache that recompile gets paid inside the 590 s budget; my submitted run had it warm from the dry-run)

…I'd expect seed-42 to land back in the 5440–5454 band and the int6 artifact to fit at 16.00 MB without prune cycles.

One thing worth flagging since you're reading the log directly: train_seed42.log tails out with final_int6_ttt val_bpb=1.33603200. That's leftover instrumentation — ttt_enabled: True got left in the wrapper's hparams block, so the script ran a TTT eval after the scored sliding-window eval. It's not the scored number for this Track-A submission; the scored value is final_int6_sliding_window val_bpb=1.08938974 with weights frozen. If anything in the audit pipeline tail-greps the log, that line could mislead.

The rental pod's gone so I can't pull its container hash, but I do still have the FA3 wheel artifact I built. Happy to send it directly, or push a repro/ subdir under the #1334 branch on my fork (https://github.com/aryanbhosale/parameter-golf/tree/submission/sp4096-no-slot-v4) with a fresh log on a freshly-provisioned 8×H100 against the pinned stack — whichever's more useful for the audit. The wheel's faster; let me know.

This was referenced Apr 4, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean) #1333

Open

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

X-Abhishek-X mentioned this pull request Apr 5, 2026

Cautious Muon + SP4096 + Depth Recurrence — val_bpb 1.1604 (non-record) #1381

Open

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Switch base: openai#1334 SP4096+DepthRecur+ParallelResid+MuonEqR (1.0…

647c26f

…897)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Fix: add brotli to pip install (required by openai#1334 base)

86d83c9

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+ParallelResid+MuonEqR for LoRA TTT lane

463266d

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Fix iterations: 200 -> 20000 to match PR openai#1334 (wallclock cap d…

299fedf

…etermines training length)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Disable CompTrain (default 0) for clean openai#1334 baseline eval

75da423

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 8)

4c97398

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 8)

9ebfdb3

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 8)

b08bf33

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 9)

ce86987

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 9)

02286f9

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1334 SP4096+DR+PR+MuonEqR (Round 9)

3de320a

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026

QK_GAIN 5.0 (from openai#1334 proven technique)

d3b8191

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026

WD 0.09 (muon_wd + embed_wd from openai#1334), revert QK_GAIN to 4.0

9ba966b

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

resouer mentioned this pull request Apr 6, 2026

Record: SP8192 + Parallel Residuals + Coprime-Stride — val_bpb 1.08459 (3-seed mean) resouer/parameter-golf#9

Closed

X-Abhishek-X mentioned this pull request Apr 6, 2026

[Record] 11L Depth Recurrence + EMA Tuning (0.9965) — val_bpb 1.0925 #1421

Open

dentity007 mentioned this pull request Apr 6, 2026

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10 #1425

Open

resouer mentioned this pull request Apr 6, 2026

Record: SP8192 + Parallel Residuals + Coprime-Stride + TTT — val_bpb 1.08286 (3-seed mean) resouer/parameter-golf#10

Closed

AbhayAnandUCSD mentioned this pull request Apr 7, 2026

Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean) #1435

Open

This was referenced Apr 7, 2026

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445

Open

[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866 #1471

Open

aryanbhosale mentioned this pull request Apr 8, 2026

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423

Open

cocohearts merged commit ed829c0 into openai:main Apr 9, 2026

cocohearts mentioned this pull request Apr 9, 2026

Update README leaderboard for April records #1511

Merged

MatoTeziTanka mentioned this pull request Apr 10, 2026

PR Acceptance Order and Competition Rules - A discussion - I want to know what you think #1522

Open

This was referenced Apr 11, 2026

Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193

Open

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant) #1554

Open

mikeapedia mentioned this pull request Apr 13, 2026

Record: Custom Casefold Tokenizer — 1.0668 BPB #1578

Open

5 tasks

X-Abhishek-X mentioned this pull request Apr 17, 2026

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695

Open

This was referenced Apr 23, 2026

Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 1.01925 (3-seed) #1785

Closed

Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795

Open

Audit 1698 lineage bpb bytecount #1804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)#1334