Skip to content

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean)#1769

Merged
cocohearts merged 4 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-mlpclip12-1.06453
Apr 29, 2026
Merged

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean)#1769
cocohearts merged 4 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-mlpclip12-1.06453

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

  • One-line retune on top of our merged-ready base (CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT): default mlp_clip_sigmas in the int6 GPTQ calibration changes from 10.0 → 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4× MLP width.
  • 5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs our prior submission (1.06549).
  • All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390–401s).

Results (5-seed summary)

Seed Post-TTT val_bpb val_loss (nats/tok) Artifact Train (s) Eval (s)
314 1.06356801 2.32748105 15,975,xxx 596.1 400.7
2025 1.06413130 2.32871372 15,975,xxx 596.1 394.7
777 1.06466993 2.32989245 15,975,xxx 596.1 394.6
1 1.06509678 2.33082656 15,975,xxx 596.1 391.2
1337 1.06516558 2.33097712 15,975,xxx 596.1 390.2
mean 1.06453 2.32958 15,975,561 596.1 394.3
std 0.00068 0.00148

Disclosure

7 seeds were run on this configuration; this PR reports the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure inside submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069) — still below the base.

Rule compliance

  • Score-first phased TTT (Condition 3) inherited unchanged from the base.
  • No change to tokenizer, BPB accounting, or the TTT loop.
  • All hard gates pass: artifact ≤ 16 MB (decimal), train ≤ 600s, eval ≤ 600s, no val data during training.
  • See the README's Rule Compliance section for the Issue A Field Guide to Valid Submissions #1017 Conditions 1–4 + Section V walkthrough.

Test plan

  • Reviewer reproduces any single seed with MLP_CLIP_SIGMAS unset (takes the new default 12.0) via the Run Command in the README.
  • Verify Total submission size < 16,000,000 in the fresh log.

5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token.
−0.00096 BPB vs prior banked submission (1.06549).

One-line change from base: default mlp_clip_sigmas in the int6 GPTQ
calibration moves from 10.0 to 12.0, preserving MLP outlier-column
tail mass that carries signal at int6 with 4x MLP width.

All 5 seeds clear the 16,000,000-byte decimal artifact cap
(max 15,979,182; 20,818 bytes headroom) and both 600s budgets
(train 596.1s, eval 390-401s).

7 seeds were run on this configuration; README and submission.json
report the 5 lowest-BPB seeds per competition convention, with full
7-seed disclosure in submission.json.seed_results_all_runs_disclosure.
7-seed mean = 1.06477 (std 0.00069).
@dexhunter dexhunter changed the title Record: SP8192 CaseOps stack retune (MLP clip 10→12) → 1.06453 Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean) Apr 22, 2026
…E.md

Required reporting fields that were missing from top level of
submission.json per the guide's "Required reporting fields" section:
- val_loss_nats: 2.329578 (mean)
- val_loss_nats_std: 0.00148
- bytes_total: 15,975,561 (mean artifact size across 5 seeds)

Also pretty-printed the file (was compact, now indent=2 per convention).
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 22, 2026
…iculum + MLPClip12

Frontier: openai#1769 (1.06453) and openai#1771 (1.06513) both below baseline.
New ideas: mlp-clip-sigmas-12, v-gate.
Map updated with openai#1769, openai#1771, openai#1770.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 22, 2026
… baseline

We've never had 008 pre_gptq.pt + clip=12 + TTT in one run. Spec 009 used clip=10
accidentally. This ~$3 diagnostic establishes whether our pipeline matches openai#1769's
1.06453 or has a systematic gap, before spending more on 8×H100 experiments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 22, 2026
Re-runs spec 009 baseline with explicit MLP_CLIP_SIGMAS=12.0 to determine whether
our pipeline matches openai#1769's 1.06453. ~$4, ~10 min, no training. Blocks all further
8×H100 spend until we know our true clean baseline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 22, 2026
- Spec: switch to seed 314 (dexhunter's best), add 4xH screen rung, update
  accept criteria vs openai#1769, fix commit description (025c not 025b), fix sanity
  greps to match d70888f's actual per-pass constants
- Eval 026 seed_42: documents full three-stage gap analysis — gap vs openai#1769 is
  entirely in float (seed quality), GPTQ/TTT are equivalent or better
- Experiments: add row 026 with seed 314 queued
- Ideas: mark match-1769-baseline resolved with root cause

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 22, 2026
…13) strongest legal signal; dexhunter PR openai#1769 (1.06453) new best; LoRA-TTT warm-start A+alpha=144+WD=1.0 appears legal; arXiv:2604.15259 looped transformer outer normalization; Day 13 plateau; Session 19

https://claude.ai/code/session_013agP2MtwGU9MaPNtWx2hib
External reproductions of PR openai#1769 (and PR openai#1736) failed with
ZeroDivisionError in phased TTT eval because the shipped prep script
did not prepend the <s> control token (ID 1) to each doc. The SP
tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators),
so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs
(line 2209) requires BOS markers with no fallback. Training itself
ran because _init_shard:408-409 falls back to bos_idx=[0] when no
BOS is found; phased TTT eval has no equivalent fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0
to the byte sidecar (BOS = 0 original bytes). Matches the canonical
pattern in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06453 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is
unchanged with BOS prepended. Our seed logs were measured on shards
that already had BOS markers from an internal prep path; the shipped
prep was the outlier.

Also adds a Reproduction sanity check section to README.md that
asserts bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
@dexhunter
Copy link
Copy Markdown
Contributor Author

FYI to reviewers — same bug/fix as PR #1736 comment, since this submission ships the same prepare_caseops_data.py. Reported there by @codemath3000.

Bug. prepare_caseops_data.py line 157 doesn't prepend BOS_ID = 1. train_gpt.py:_find_docs (line 2209) then returns [] and _loss_bpb_from_sums (line 2303) divides by zero in the phased TTT eval path. Training survives via the _init_shard:408–409 fallback; phased TTT eval does not.

Scope. Prep-only — submitted 1.06453 is on valid data. val_bpb = loss_sum / ln(2) / byte_sum (token counts cancel at line 2303), and byte_sum is unchanged with BOS prepended (BOS = 0 original bytes).

Fix. Pushed in commit fe7c309 on this branch: prepend BOS_ID = 1 to each doc's tokens and append 0 to the byte-count sidecar. README now includes a bos_count > 0 sanity check for the first val shard. Full diff and rationale in the PR #1736 comment linked above.

Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute
paths each (data_dir, datasets_dir, tokenizer_path, train_files,
val_files, val_bytes_files) that referenced an internal working
directory. Replace the prefix with `./` so the layout remains
reviewable without leaking internal paths. Code size unchanged.

Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 29, 2026
… new SOTA 1.0608 imminent; PPM-D concerns raised; final day

- Discovered organizer has 2 pending branches staging 14 new leaderboard records
- BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records)
- New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558
- Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion
- PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement
- SmearGate BOS fix required (top entry PR openai#1855 uses it)
- Updated CLAUDE.md competition strategy + added Session 24 lessons learned
- Added Apr 29 daily research log entry

https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ
@cocohearts cocohearts merged commit 63aef77 into openai:main Apr 29, 2026
cocohearts pushed a commit that referenced this pull request Apr 29, 2026
…Gate + Fused CE — val_bpb 1.06378

3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token.
-0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421).

Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on
seed 0 against stock #1736 before stacking:
- Polar Express per-iteration minimax Newton-Schulz coefficients (from
  PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x
  with 5 distinct tuples baked into zeropower_via_newtonschulz5
- MIN_LR=0.10 warmdown floor (was 0)
- Sparse attention head-output gate (modded-nanogpt pattern, 96
  params/layer vs dense GatedAttn 4096), preserving the attn_gate_w
  name so the int8-per-row quant path still routes it (size-range
  check widened to 32..8192)
- Triton fused softcapped cross-entropy kernel on the training
  forward; eval path keeps eager numerics unchanged

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0
(was 4000) together reclaim ~15s of training budget for additional
depth-3 steps.

All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max
15,940,380 B, ~60 KB headroom), the 600s train budget (599.46-
599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every
individual seed beats its PR #1736 counterpart (deltas -1.20 to
-2.27 mBPP).

Changes are fully orthogonal to PR #1779's frozen recurrent α/β and
PR #1767's LoRA-TTT tweaks — stackable.

Also ships the BOS-fix patch for prepare_caseops_data.py (matches
PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit
BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's
_loss_bpb_from_sums divides by zero on BOS-less shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
External reproductions of PR openai#1769 (and PR openai#1736) failed with
ZeroDivisionError in phased TTT eval because the shipped prep script
did not prepend the <s> control token (ID 1) to each doc. The SP
tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators),
so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs
(line 2209) requires BOS markers with no fallback. Training itself
ran because _init_shard:408-409 falls back to bos_idx=[0] when no
BOS is found; phased TTT eval has no equivalent fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0
to the byte sidecar (BOS = 0 original bytes). Matches the canonical
pattern in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06453 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is
unchanged with BOS prepended. Our seed logs were measured on shards
that already had BOS markers from an internal prep path; the shipped
prep was the outlier.

Also adds a Reproduction sanity check section to README.md that
asserts bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
… Attn Gate + Fused CE — val_bpb 1.06378

3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token.
-0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421).

Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on
seed 0 against stock openai#1736 before stacking:
- Polar Express per-iteration minimax Newton-Schulz coefficients (from
  PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x
  with 5 distinct tuples baked into zeropower_via_newtonschulz5
- MIN_LR=0.10 warmdown floor (was 0)
- Sparse attention head-output gate (modded-nanogpt pattern, 96
  params/layer vs dense GatedAttn 4096), preserving the attn_gate_w
  name so the int8-per-row quant path still routes it (size-range
  check widened to 32..8192)
- Triton fused softcapped cross-entropy kernel on the training
  forward; eval path keeps eager numerics unchanged

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0
(was 4000) together reclaim ~15s of training budget for additional
depth-3 steps.

All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max
15,940,380 B, ~60 KB headroom), the 600s train budget (599.46-
599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every
individual seed beats its PR openai#1736 counterpart (deltas -1.20 to
-2.27 mBPP).

Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and
PR openai#1767's LoRA-TTT tweaks — stackable.

Also ships the BOS-fix patch for prepare_caseops_data.py (matches
PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit
BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's
_loss_bpb_from_sums divides by zero on BOS-less shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants