Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean) by dexhunter · Pull Request #1769 · openai/parameter-golf

dexhunter · 2026-04-22T03:36:07Z

Summary

One-line retune on top of our merged-ready base (CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT): default mlp_clip_sigmas in the int6 GPTQ calibration changes from 10.0 → 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4× MLP width.
5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs our prior submission (1.06549).
All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390–401s).

Results (5-seed summary)

Seed	Post-TTT val_bpb	val_loss (nats/tok)	Artifact	Train (s)	Eval (s)
314	1.06356801	2.32748105	15,975,xxx	596.1	400.7
2025	1.06413130	2.32871372	15,975,xxx	596.1	394.7
777	1.06466993	2.32989245	15,975,xxx	596.1	394.6
1	1.06509678	2.33082656	15,975,xxx	596.1	391.2
1337	1.06516558	2.33097712	15,975,xxx	596.1	390.2
mean	1.06453	2.32958	15,975,561	596.1	394.3
std	0.00068	0.00148	—	—	—

Disclosure

7 seeds were run on this configuration; this PR reports the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure inside submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069) — still below the base.

Rule compliance

Score-first phased TTT (Condition 3) inherited unchanged from the base.
No change to tokenizer, BPB accounting, or the TTT loop.
All hard gates pass: artifact ≤ 16 MB (decimal), train ≤ 600s, eval ≤ 600s, no val data during training.
See the README's Rule Compliance section for the Issue A Field Guide to Valid Submissions #1017 Conditions 1–4 + Section V walkthrough.

Test plan

Reviewer reproduces any single seed with MLP_CLIP_SIGMAS unset (takes the new default 12.0) via the Run Command in the README.
Verify Total submission size < 16,000,000 in the fresh log.

5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs prior banked submission (1.06549). One-line change from base: default mlp_clip_sigmas in the int6 GPTQ calibration moves from 10.0 to 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4x MLP width. All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390-401s). 7 seeds were run on this configuration; README and submission.json report the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure in submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069).

…E.md Required reporting fields that were missing from top level of submission.json per the guide's "Required reporting fields" section: - val_loss_nats: 2.329578 (mean) - val_loss_nats_std: 0.00148 - bytes_total: 15,975,561 (mean artifact size across 5 seeds) Also pretty-printed the file (was compact, now indent=2 per convention).

…iculum + MLPClip12 Frontier: openai#1769 (1.06453) and openai#1771 (1.06513) both below baseline. New ideas: mlp-clip-sigmas-12, v-gate. Map updated with openai#1769, openai#1771, openai#1770. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… baseline We've never had 008 pre_gptq.pt + clip=12 + TTT in one run. Spec 009 used clip=10 accidentally. This ~$3 diagnostic establishes whether our pipeline matches openai#1769's 1.06453 or has a systematic gap, before spending more on 8×H100 experiments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Re-runs spec 009 baseline with explicit MLP_CLIP_SIGMAS=12.0 to determine whether our pipeline matches openai#1769's 1.06453. ~$4, ~10 min, no training. Blocks all further 8×H100 spend until we know our true clean baseline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Spec: switch to seed 314 (dexhunter's best), add 4xH screen rung, update accept criteria vs openai#1769, fix commit description (025c not 025b), fix sanity greps to match d70888f's actual per-pass constants - Eval 026 seed_42: documents full three-stage gap analysis — gap vs openai#1769 is entirely in float (seed quality), GPTQ/TTT are equivalent or better - Experiments: add row 026 with seed 314 queued - Ideas: mark match-1769-baseline resolved with root cause Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…13) strongest legal signal; dexhunter PR openai#1769 (1.06453) new best; LoRA-TTT warm-start A+alpha=144+WD=1.0 appears legal; arXiv:2604.15259 looped transformer outer normalization; Day 13 plateau; Session 19 https://claude.ai/code/session_013agP2MtwGU9MaPNtWx2hib

@codemath3000

External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

dexhunter · 2026-04-23T00:52:22Z

FYI to reviewers — same bug/fix as PR #1736 comment, since this submission ships the same prepare_caseops_data.py. Reported there by @codemath3000.

Bug. prepare_caseops_data.py line 157 doesn't prepend BOS_ID = 1. train_gpt.py:_find_docs (line 2209) then returns [] and _loss_bpb_from_sums (line 2303) divides by zero in the phased TTT eval path. Training survives via the _init_shard:408–409 fallback; phased TTT eval does not.

Scope. Prep-only — submitted 1.06453 is on valid data. val_bpb = loss_sum / ln(2) / byte_sum (token counts cancel at line 2303), and byte_sum is unchanged with BOS prepended (BOS = 0 original bytes).

Fix. Pushed in commit fe7c309 on this branch: prepend BOS_ID = 1 to each doc's tokens and append 0 to the byte-count sidecar. README now includes a bos_count > 0 sanity check for the first val shard. Full diff and rationale in the PR #1736 comment linked above.

Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute paths each (data_dir, datasets_dir, tokenizer_path, train_files, val_files, val_bytes_files) that referenced an internal working directory. Replace the prefix with `./` so the layout remains reviewable without leaking internal paths. Code size unchanged. Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env var is not read by train_gpt.py. The two phased-TTT env vars that ARE read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT is gated by the top-level TTT_ENABLED=1 which defaults to on.

@valerio-oai

… new SOTA 1.0608 imminent; PPM-D concerns raised; final day - Discovered organizer has 2 pending branches staging 14 new leaderboard records - BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records) - New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558 - Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion - PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement - SmearGate BOS fix required (top entry PR openai#1855 uses it) - Updated CLAUDE.md competition strategy + added Session 24 lessons learned - Added Apr 29 daily research log entry https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ

…Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421). Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on seed 0 against stock #1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR #1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR #1779's frozen recurrent α/β and PR #1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@codemath3000

External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

… Attn Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421). Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on seed 0 against stock openai#1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR openai#1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and PR openai#1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

dexhunter changed the title ~~Record: SP8192 CaseOps stack retune (MLP clip 10→12) → 1.06453~~ Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean) Apr 22, 2026

bigbag mentioned this pull request Apr 22, 2026

Record: SP8192 CaseOps + V13 Curriculum + SmearGate + LoRA-TTT — val_bpb 1.06513 (3-seed mean) #1771

Open

3 tasks

dexhunter mentioned this pull request Apr 23, 2026

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736

Merged

5 tasks

dexhunter mentioned this pull request Apr 23, 2026

Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779

Open

3 tasks

OE-GOD mentioned this pull request Apr 23, 2026

Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 1.01925 (3-seed) #1785

Closed

5 tasks

nprime06 mentioned this pull request Apr 23, 2026

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787

Merged

6 tasks

dexhunter mentioned this pull request Apr 23, 2026

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean) #1791

Open

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

cocohearts mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard with BOS fix #1902

Merged

cocohearts merged commit 63aef77 into openai:main Apr 29, 2026

dexhunter mentioned this pull request Apr 30, 2026

Record: PR #1908 base + GPTQ module-damp + Asym Logit Rescale — val_bpb 1.06048 (3-seed mean) #2051

Closed

IanniMuliterno mentioned this pull request May 1, 2026

[Record Candidate] SP8192 · GatedAttn + Phased TTT + LQER · 10 min / 16 MB #2065

Open

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

codemath3000 mentioned this pull request May 2, 2026

Record: SP8192 + Sliding-Window Eval + Lock-In Byte Mixer - val_bpb 1.067219 #2138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean)#1769

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean)#1769
cocohearts merged 4 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-mlpclip12-1.06453

dexhunter commented Apr 22, 2026

Uh oh!

dexhunter commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 22, 2026

Summary

Results (5-seed summary)

Disclosure

Rule compliance

Test plan

Uh oh!

dexhunter commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants