Skip to content

Record support: canonical top-stack reproduction - val_bpb 1.05985#2031

Open
deborahnelson8788726 wants to merge 1 commit intoopenai:mainfrom
deborahnelson8788726:codex/pgolf-canonical-repro-105985
Open

Record support: canonical top-stack reproduction - val_bpb 1.05985#2031
deborahnelson8788726 wants to merge 1 commit intoopenai:mainfrom
deborahnelson8788726:codex/pgolf-canonical-repro-105985

Conversation

@deborahnelson8788726
Copy link
Copy Markdown

Adds a single-seed canonical reproduction/support record for the public
2026-04-27_SP8192_LQER_SparseGate_BOSSmearFix_9HpStack_1.0611 stack.

Result:

  • Seed 42 post-TTT val_bpb: 1.05985469
  • val_loss: 2.31935492
  • Train cap: 599.570s, 4935 steps on 8xH100 SXM
  • Eval time: 539.328s
  • Total submission size: 15,898,155 bytes
  • Serialized quantized model: 15,866,055 bytes
  • Artifact SHA256: 47d7339fa803c52559a3acfbe4c682332c3871560fe55bea1cb112923d43c298

This is intentionally framed as reproduction/support evidence rather than a new
technique claim. It uses the source record's train_gpt.py, tokenizer, CaseOps
pipeline, compression path, and hparam stack. The operational fix versus our
earlier failed attempts was using the canonical romeerp/parameter-golf-caseops-v1
pretokenized shards instead of locally re-tokenized raw docs.

The artifact/log timestamp is 2026-05-01 00:34 +0200, i.e.
2026-04-30 22:34 UTC.

Validation run locally before PR:

  • python3 -m json.tool records/track_10min_16mb/2026-04-30_SP8192_LQER_SparseGate_Canonical_Repro_1.05985/submission.json
  • python3 -m py_compile records/track_10min_16mb/2026-04-30_SP8192_LQER_SparseGate_Canonical_Repro_1.05985/train_gpt.py

@deborahnelson8788726 deborahnelson8788726 force-pushed the codex/pgolf-canonical-repro-105985 branch from 05b7b8f to 5bbeca8 Compare April 30, 2026 22:44
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
…ence

After user feedback that LEAK calls relied too heavily on lineage-inheritance
and path heuristics, applied stricter criterion: a LEAK verdict requires at
least one of (a) explicit shell-script invocation of prepare_caseops_data.py
without --val-docs=50000, (b) README "Data setup" matching actual train log
path, (c) audit/submission.json admission text, (d) train log path with
`_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>`
(which only local prep produces; HF always gives double-nesting).

Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS
unless they meet at least one of those tests.

Changes:
  - openai#1945 LEAK → CLEAN  (finalize_v18.sh has snapshot_download from HF;
    actual run path matches HF target; README's prepare_caseops_data.py
    section is stale documentation)
  - openai#1953 LEAK → AMBIGUOUS  (PR ships only train_gpt.py + logs; no prep
    evidence; path matches HF target; parent openai#1945 confirmed CLEAN —
    leans CLEAN but no direct PR evidence)
  - openai#2041 LEAK → AMBIGUOUS  (no prep invocation; double-nested path
    consistent with EITHER HF or local prep)
  - openai#2075 LEAK → AMBIGUOUS  (ships prep file but no explicit invocation;
    path matches HF target)

Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1).

Headline impact: realistic clean SOTA is at most ~0.012 bpb below the
claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order:
  openai#2019 1.05847 (HF, confirmed)
  openai#1953 1.05855 (AMBIGUOUS, leans CLEAN)
  openai#1945 1.05943 (HF, confirmed via re-audit)
  openai#2031 1.05985 (HF, confirmed)
  openai#1908 1.06081 (HF, confirmed)
  openai#1851 1.06128 (HF, MERGED SOTA)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant