Record support: canonical top-stack reproduction - val_bpb 1.05985 by deborahnelson8788726 · Pull Request #2031 · openai/parameter-golf

deborahnelson8788726 · 2026-04-30T22:42:03Z

Adds a single-seed canonical reproduction/support record for the public
2026-04-27_SP8192_LQER_SparseGate_BOSSmearFix_9HpStack_1.0611 stack.

Result:

Seed 42 post-TTT val_bpb: 1.05985469
val_loss: 2.31935492
Train cap: 599.570s, 4935 steps on 8xH100 SXM
Eval time: 539.328s
Total submission size: 15,898,155 bytes
Serialized quantized model: 15,866,055 bytes
Artifact SHA256: 47d7339fa803c52559a3acfbe4c682332c3871560fe55bea1cb112923d43c298

This is intentionally framed as reproduction/support evidence rather than a new
technique claim. It uses the source record's train_gpt.py, tokenizer, CaseOps
pipeline, compression path, and hparam stack. The operational fix versus our
earlier failed attempts was using the canonical romeerp/parameter-golf-caseops-v1
pretokenized shards instead of locally re-tokenized raw docs.

The artifact/log timestamp is 2026-05-01 00:34 +0200, i.e.
2026-04-30 22:34 UTC.

Validation run locally before PR:

python3 -m json.tool records/track_10min_16mb/2026-04-30_SP8192_LQER_SparseGate_Canonical_Repro_1.05985/submission.json
python3 -m py_compile records/track_10min_16mb/2026-04-30_SP8192_LQER_SparseGate_Canonical_Repro_1.05985/train_gpt.py

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

…ence After user feedback that LEAK calls relied too heavily on lineage-inheritance and path heuristics, applied stricter criterion: a LEAK verdict requires at least one of (a) explicit shell-script invocation of prepare_caseops_data.py without --val-docs=50000, (b) README "Data setup" matching actual train log path, (c) audit/submission.json admission text, (d) train log path with `_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>` (which only local prep produces; HF always gives double-nesting). Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS unless they meet at least one of those tests. Changes: - openai#1945 LEAK → CLEAN (finalize_v18.sh has snapshot_download from HF; actual run path matches HF target; README's prepare_caseops_data.py section is stale documentation) - openai#1953 LEAK → AMBIGUOUS (PR ships only train_gpt.py + logs; no prep evidence; path matches HF target; parent openai#1945 confirmed CLEAN — leans CLEAN but no direct PR evidence) - openai#2041 LEAK → AMBIGUOUS (no prep invocation; double-nested path consistent with EITHER HF or local prep) - openai#2075 LEAK → AMBIGUOUS (ships prep file but no explicit invocation; path matches HF target) Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1). Headline impact: realistic clean SOTA is at most ~0.012 bpb below the claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order: openai#2019 1.05847 (HF, confirmed) openai#1953 1.05855 (AMBIGUOUS, leans CLEAN) openai#1945 1.05943 (HF, confirmed via re-audit) openai#2031 1.05985 (HF, confirmed) openai#1908 1.06081 (HF, confirmed) openai#1851 1.06128 (HF, MERGED SOTA)

Add canonical top-stack reproduction result

5bbeca8

deborahnelson8788726 force-pushed the codex/pgolf-canonical-repro-105985 branch from 05b7b8f to 5bbeca8 Compare April 30, 2026 22:44

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record support: canonical top-stack reproduction - val_bpb 1.05985#2031

Record support: canonical top-stack reproduction - val_bpb 1.05985#2031
deborahnelson8788726 wants to merge 1 commit intoopenai:mainfrom
deborahnelson8788726:codex/pgolf-canonical-repro-105985

deborahnelson8788726 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deborahnelson8788726 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant