Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean)#1769
Conversation
5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs prior banked submission (1.06549). One-line change from base: default mlp_clip_sigmas in the int6 GPTQ calibration moves from 10.0 to 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4x MLP width. All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390-401s). 7 seeds were run on this configuration; README and submission.json report the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure in submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069).
…E.md Required reporting fields that were missing from top level of submission.json per the guide's "Required reporting fields" section: - val_loss_nats: 2.329578 (mean) - val_loss_nats_std: 0.00148 - bytes_total: 15,975,561 (mean artifact size across 5 seeds) Also pretty-printed the file (was compact, now indent=2 per convention).
…iculum + MLPClip12 Frontier: openai#1769 (1.06453) and openai#1771 (1.06513) both below baseline. New ideas: mlp-clip-sigmas-12, v-gate. Map updated with openai#1769, openai#1771, openai#1770. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… baseline We've never had 008 pre_gptq.pt + clip=12 + TTT in one run. Spec 009 used clip=10 accidentally. This ~$3 diagnostic establishes whether our pipeline matches openai#1769's 1.06453 or has a systematic gap, before spending more on 8×H100 experiments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-runs spec 009 baseline with explicit MLP_CLIP_SIGMAS=12.0 to determine whether our pipeline matches openai#1769's 1.06453. ~$4, ~10 min, no training. Blocks all further 8×H100 spend until we know our true clean baseline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Spec: switch to seed 314 (dexhunter's best), add 4xH screen rung, update accept criteria vs openai#1769, fix commit description (025c not 025b), fix sanity greps to match d70888f's actual per-pass constants - Eval 026 seed_42: documents full three-stage gap analysis — gap vs openai#1769 is entirely in float (seed quality), GPTQ/TTT are equivalent or better - Experiments: add row 026 with seed 314 queued - Ideas: mark match-1769-baseline resolved with root cause Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…13) strongest legal signal; dexhunter PR openai#1769 (1.06453) new best; LoRA-TTT warm-start A+alpha=144+WD=1.0 appears legal; arXiv:2604.15259 looped transformer outer normalization; Day 13 plateau; Session 19 https://claude.ai/code/session_013agP2MtwGU9MaPNtWx2hib
External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.
|
FYI to reviewers — same bug/fix as PR #1736 comment, since this submission ships the same Bug. Scope. Prep-only — submitted 1.06453 is on valid data. Fix. Pushed in commit fe7c309 on this branch: prepend |
Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute
paths each (data_dir, datasets_dir, tokenizer_path, train_files,
val_files, val_bytes_files) that referenced an internal working
directory. Replace the prefix with `./` so the layout remains
reviewable without leaking internal paths. Code size unchanged.
Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
… new SOTA 1.0608 imminent; PPM-D concerns raised; final day - Discovered organizer has 2 pending branches staging 14 new leaderboard records - BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records) - New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558 - Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion - PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement - SmearGate BOS fix required (top entry PR openai#1855 uses it) - Updated CLAUDE.md competition strategy + added Session 24 lessons learned - Added Apr 29 daily research log entry https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ
…Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421). Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on seed 0 against stock #1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR #1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR #1779's frozen recurrent α/β and PR #1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.
… Attn Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421). Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on seed 0 against stock openai#1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR openai#1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and PR openai#1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
Summary
mlp_clip_sigmasin the int6 GPTQ calibration changes from 10.0 → 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4× MLP width.Results (5-seed summary)
Disclosure
7 seeds were run on this configuration; this PR reports the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure inside
submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069) — still below the base.Rule compliance
Test plan
MLP_CLIP_SIGMASunset (takes the new default 12.0) via the Run Command in the README.Total submission size< 16,000,000 in the fresh log.