Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) by dexhunter · Pull Request #1514 · openai/parameter-golf

dexhunter · 2026-04-09T23:37:30Z

Summary

val_bpb: 1.07983 (3-seed mean, std 0.00050) / 2.78932 nats/token
Artifact: ~15.99 MB (under 16 MB on all 3 seeds)
Delta vs current merged SOTA #1493 (1.0810): 0.00117 bpb / 0.00302 nats/token

Builds on @clarkkev's PR #1394 sp8192 stack and our own PR #1413 legal score-first TTT, adding:

Muon momentum = 0.97 (vs 0.99 default) — single-knob hyperparameter sweep
Causal token n-gram tilt — prefix-only token expert from @abaybektursun's PR #1420 kernel (base_beta=2.0, agree_bonus=0.1); within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal without losing most of the benefit.
Legal score-first TTT — already present in our PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Pre-TTT sliding	Post-TTT bpb	val_loss (nats)	Artifact
0	1.08102	1.07928	2.78790	15,993,346
42	1.08167	1.07997	2.78967	15,992,995
1234	1.08194	1.08025	2.79039	15,994,604
mean	1.08154	1.07983	2.78932	15,993,648

std_bpb = 0.00050, std_nats = 0.00128. All 3 seeds fit the 16 MB artifact cap and complete under 600s train + 600s eval.

Legality

Score-first TTT only — every sliding-window chunk is scored under inference_mode() before any gradient update. No chunk is trained on before scoring.
Causal n-gram tilt — only the prefix-only token expert is active. The within-word and word-start experts from PR Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420 are explicitly zeroed out. The kernel causality fix per the PR Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420 author thread is applied.
No SLOT, no pre-quant TTT on val data, no n-gram cache, no ETLB.
Full 50k-doc val split, canonical ordering, single left-to-right pass.

Test plan

3-seed verification (seeds 0/42/1234)
Artifact under 16 MB on all seeds
Train under 600s on all seeds (~588s)
Eval under 600s on all seeds (<437s)
No val-data leakage in training
Score-first TTT ordering verified
Causal n-gram tilt verified (prefix-only metadata)

@clarkkev

…val_bpb 1.07983 3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack. Changes from PR openai#1394 + PR openai#1413 baseline: - Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged - Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal score-first TTT; within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal. - 3-seed verification (seeds 0/42/1234) Seeds: - seed 0 → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes - seed 42 → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes - seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes - mean → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes Delta vs current merged SOTA PR openai#1493 (1.0810): 0.00117 bpb / 0.00302 nats per token Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun (n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT precedent PR openai#549 / PR openai#461. Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval <437s per seed, both under the 600s budget. Artifact under 16 MB on all 3 seeds.

…6, zero-init)

…no hash)

…l PR)

…r repack)

…cked)

…g + Muon 0.97 — val_bpb 1.07747 (3-seed mean) - 3-seed mean: 1.07747 BPP (std 0.00064) / 2.78321 nats - ~15.99 MB artifact, 8×H100 SXM, 600s - VarLen attention (within-document only), doc-independent LoRA TTT - Parameter banking + triple depth recurrence + parallel residuals - PyTorch MLP fallback (no Triton/CUTLASS dependency) - Based on PR openai#1530, PR openai#1523, PR openai#1514

Both experts gate on properties of the token being scored (target_i): - within-doc C gate: !is_boundary[target_i] && !is_new_word[target_i] → within_valid[i]=1 only when target is a within-word continuation - word-start Python gate: starts_new_word_lut[target_i] → top_prob[i]>0 only when target IS a word-start token Both violate C1 causality (hint for position i depends on realized token i). Token expert is legal: output computed from prefix state [0..i-1] before tok_i is consumed (token_push runs after ctx_tbl lookup in C process_chunk). Fix: within_tau=99.0, word_tau=99.0, within_boost=0.0, word_boost=0.0 as defaults so both gates are always False. Token-only is the legal subset per PR openai#1514 merge precedent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Recent sweep logs (named): S55: token-only ngram tilt baseline = 1.05814 (legal per PR openai#1514) S56: + 3 openai#2060 levers = 1.05790 (-0.00024) S57: + AsymLogit only = 1.05759 (-0.00055) S58: full stack = 1.05694 single seed (-0.00120, super-additive +0.00041 synergy) S59: S58 + EVAL_SEQ_LEN=3072 + NUM_PHASES=1 + WD=1.0 = 1.05657 single seed, eval 567s S60 OOM: S59 + EMA_DECAY=0.9 + batch=64 = OOM S60 retry: S58 + EMA_DECAY=0.9 + batch=32 = 1.05795 / 832s NON-COMPLIANT S61: S59 + TOKEN_BOOST=3.0 = 1.05678 single seed, eval 501s S62: S58 + NUM_PHASES=2 + WD=2.0 + eval=2816 = 1.05755 Earlier sweep logs (UUID-named): ~83 files covering S15-S54 sprint history. Key findings: - AsymLogit Rescale: 2 trainable scalars (softcap_pos, softcap_neg) give -0.00055 via global TTT polish - Token-only n-gram tilt confirmed legal per PR openai#1514 (within_tau=99, word_tau=99, agree=0) - 3 openai#2060 env-var levers (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) stack super-additively - EMA_DECAY=0.9 didn't transfer to our base - NUM_PHASES=2 revert costs more pre-quant than it gains in TTT recovery - Discovered val_tokens=47852544 vs canonical 47853343, need EVAL_INCLUDE_TAIL=1 for clean comparison Added .gitignore for final_model.pt (130MB - over GitHub limit), .so binaries, pid files.

…E_OUTSIDE=0 Seed 314: pre-quant 1.06128 / quant 1.06962 / final 1.05701 / eval 571.7s Compliance: ngram_hint_precompute_outside=False, precompute (166.95s) INSIDE timer per PR openai#1514 precedent. Token-only tilt: within_gate=0, word_gate=0 - legal per PR openai#1514. Size 15,943,530 bytes. Single seed beats openai#2014's 3-seed mean (1.05759). Validating seeds 42 and 1234.

Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB. Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB. Beats PR openai#2060 (1.05792) by 0.00122 BPB. Stack: - Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled) - AsymLogit Rescale (2 trainable scalars adapted by global TTT) - 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) - PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014) - NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514) Compliance: - All seeds eval ≤533.1s (cap 600s, 67-80s margin) - All artifacts ≤15.95MB (cap 16MB) - Token-only n-gram channel (within_gate=0, word_gate=0) - Score-first TTT (per PR openai#402)

Clarified explanation of the gate's behavior and updated the description of the fix in PR openai#1514.

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R19: openai#1514 base + logit-space hash (8K buckets, bigram key, bf1…

2f474c2

…6, zero-init)

dexhunter changed the title ~~Record: SP8192 + Muon 0.97 + Legal TTT + Causal N-gram Tilt — val_bpb 1.07983 (3-seed)~~ Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) Apr 10, 2026

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: Deploy openai#1514 base (Muon 0.97 + Tilt + TTT + QK5.0, clean, …

3d46e5e

…no hash)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: openai#1514 packed with FORMAT_RAW+FILTER_LZMA2 (same as origina…

33d00f5

…l PR)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: use EXACT original openai#1514 packed code (their binary, not ou…

334c086

…r repack)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: openai#1514 + hidden-space hash embedding (16K buckets, 512d, pa…

5f6fbd4

…cked)

aryanbhosale mentioned this pull request Apr 10, 2026

Record: SP8192 + Muon 0.97 + 3-Layer Recurrence + Parallel Residuals + TTT — val_bpb 1.0802 (3-seed mean) #1521

Open

EthanYangTW mentioned this pull request Apr 10, 2026

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523

Closed

aryanbhosale mentioned this pull request Apr 11, 2026

Record: SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + TTT — val_bpb 1.0790 (5-seed mean) #1533

Open

dexhunter mentioned this pull request Apr 11, 2026

Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) #1536

Open

bigbag mentioned this pull request Apr 11, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541

Open

ndokutovich mentioned this pull request Apr 12, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + TTT 5ep + N-gram Tilt + Hessian SDClip — val_bpb 1.07730 #1557

Open

7 tasks

EthanYangTW mentioned this pull request Apr 12, 2026

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0783 (3-seed mean) #1561

Open

EthanNing mentioned this pull request Apr 26, 2026

Records: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7) #1812

Open

cocohearts mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard with BOS fix #1902

Merged

cocohearts merged commit 5223920 into openai:main Apr 29, 2026

leon2k2k2k mentioned this pull request Apr 30, 2026

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967

Open

simon-marcus mentioned this pull request May 1, 2026

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018

Closed

andrewbaggio1 mentioned this pull request May 1, 2026

Record [corrected] : 1.05770 Gated XSA + token-only n-gram tilt + LQER top-1 + AWQ-lite + AsymLogit) with GPTQ_RESERVE_SECONDS=2.0 and corrected CaseOps data preparation #2118

Open

This was referenced May 1, 2026

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2123

Closed

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2124

Open

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 vaibhavmishra1/parameter-golf#1

Merged

okezue mentioned this pull request May 1, 2026

Non-record: Confidence-Adaptive N-gram Boost on PR #2018 stack, val_bpb=1.05874 #2129

Open

TanishGudise mentioned this pull request May 1, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

This was referenced May 2, 2026

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135

Open

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 5, 2026

Refine explanation of token prediction and PR openai#1514

237f935

Clarified explanation of the gate's behavior and updated the description of the fix in PR openai#1514.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean)#1514

Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean)#1514
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:a2-muon097-3seed

dexhunter commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 9, 2026

Summary

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Legality

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants