Update Parameter Golf leaderboard with BOS fix by cocohearts · Pull Request #1902 · openai/parameter-golf

cocohearts · 2026-04-28T18:57:39Z

README-only p-value progression leaderboard update. Applies the p<0.25 chronological progression cutoff after scanning PRs #1494-#1908 and addressing follow-up review comments. Keeps the chronological-frontier rows #1514/#1529/#1530/#1610/#1626/#1667/#1729/#1736/#1769/#1787/#1851/#1868/#1855. Includes #1529 because its ML code/record evidence predates #1530 and its corrected 3-seed mean 1.07578747 beats the prior #1514 frontier (p=0.001); #1530 remains because it later beats #1529. Excludes #1518 because its record code/evidence landed after #1530 and its final mean is worse than #1530, so it is not a chronological frontier row. Excludes #1584 because it never becomes a frontier row under the same code/evidence chronology (#1530 is already better). Adds #1855 as the new top row using its submitted compliant 3-seed mean 1.06108, with broader reproduction evidence giving p=0.188 vs the latest #1868 compliance rerun; excludes valid-but-non-progression rows plus invalid/conditional rows (PPM-D/byte-mixture C2, pre-quant/future-validation leakage, over-cap artifacts, duplicates, missing-evidence submissions, p-fail rows, and single-seed tiny-margin rows).

Co-authored-by: Codex <noreply@openai.com>

codemath3000 · 2026-04-28T19:36:57Z

@cocohearts Thank you so much for taking a look! I was looking over the results, and, including the independent reproduction done on #1855 (comment) as additional samples of #1855's distribution, the 6-sample picture is:

seed	source	val_bpb
42	#1855 submitted runs	1.05989
0	#1855 submitted runs	1.06125
1234	#1855 submitted runs	1.06209
42	@okezue reproduction	1.05965
314	@okezue reproduction	1.06041
999	@okezue reproduction	1.06124

6-sample mean 1.060755 BPB, sample std (n−1) 0.000933.

Welch's two-sample t-test vs #1851/#1868 (n=3, mean 1.06145, std 0.00068):

Mean delta: 0.000695 BPB (~0.00152 nats)
SE ≈ 0.000547, t ≈ 1.27, df ≈ 5.6
One-sided p ≈ 0.127493397391

That's under the 0.25 threshold compared to #1851/#1868. Therefore, #1855 does appear to be valid per my understanding.

PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855 only on significance grounds (p=0.325). Our prior 050 line built on openai#1797 which is under validity-cloud per cocohearts. Re-anchor research baseline on openai#1855's accepted chain. Pure port — zero modifications. Files copied verbatim from codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack @ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/. Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.) on this baseline.

codemath3000 · 2026-04-28T20:07:18Z

@cocohearts Separate issue from the #1855 chain-inclusion question above — flagging a concern about #1518 that's independent of anything on #1855. It came up on #1900 (here), but it bears on the chain decisions being made in this leaderboard update, so consolidating here as well.

Timeline / scored value at opening

Per @msisovic's note on #1900: when #1518 was first opened, its score was worse than #1529's score at #1529's opening, and worse than #1584's score at #1584's opening as well. By score-at-opening, both #1529 and #1584 came in ahead of #1518, and we'd appreciate them being included in the chain on that basis.

It's also worth noting that #1584 is valid irrespective of statistical significance, per the official README rule:

"For submissions that improve speed through systems optimization without changing the ML, [the statistical significance] requirement is waived."

#1584 is a systems-optimization submission (no ML changes), so the statistical-significance bar doesn't apply to its inclusion.

This is consequential because chain inclusion of #1518 currently displaces #1529 and #1584 from the SOTA timeline. Sharing for the maintainers' chain-inclusion call — and very much appreciate the careful work going into reconstructing the chain.

Co-authored-by: Codex <noreply@openai.com>

msisovic · 2026-04-28T20:32:44Z

Cross posting my comment from the other PR:

#1900 (comment)

Co-authored-by: Codex <noreply@openai.com>

cocohearts · 2026-04-28T20:33:13Z

Addressed both follow-up comments in the README table. I added #1855 as the new top row using the combined 6-sample evidence from the submission plus independent reproduction (one-sided Welch p≈0.127 vs #1851/#1868). I also added #1529 and #1584 to reflect score-at-opening chronology before #1518's later score update; #1584 is marked as a systems-only progression row, so the statistical-significance requirement is waived under the README rule. Direct #1784/#1797 remain excluded under the p<0.25 progression cutoff, with #1797 credited through downstream BOS-fixed rows.

cocohearts · 2026-04-28T20:33:53Z

Addressed both follow-up comments in the README table. I added #1855 as the new top row using the combined 6-sample evidence from the submission plus independent reproduction (one-sided Welch p≈0.127 vs #1851/#1868). I also added #1529 and #1584 to reflect score-at-opening chronology before #1518's later score update; #1584 is marked as a systems-only progression row, so the statistical-significance requirement is waived under the README rule. Direct #1784/#1797 remain excluded under the p<0.25 progression cutoff, with #1797 credited through downstream BOS-fixed rows.

msisovic · 2026-04-28T20:40:04Z

Thanks for taking a look @cocohearts!

codemath3000 · 2026-04-28T20:41:07Z

@cocohearts Thank you so much for working through all of this and for handling the resolution. Really appreciate the time you put into the leaderboard update. Needless to say, please feel free to follow up if any further questions or concerns come up on my end of things, happy to dig into anything further.

CiprianFlorin-Ifrim · 2026-04-28T20:59:11Z

@cocohearts Would there be a chance for you to look at the PRs that were published before 1400? Some PRs had the highest score before some of the new ones and they got ignored.

Separately, will you have a chance to look at PRs that specifically target the 2nd leaderboard? I have 3 PRs (Ternary #923 this one adds to the binary run that's already present in the 2nd leaderboard, XNOR-net #1388, LeWorldModel Mamba2 #903, all 10 mins and unlimited compute) and I'm sure others have many too that were for the 2nd leaderboard.

serdardoesml · 2026-04-29T13:04:44Z

Any reason why this is not merged? Can we consider #1855 the current best record?

…ams) After 4 parallel research agents reviewed 30+ open PRs and compliance issues, two new findings: 1. PR openai#1923 (AsymLogit) flagged "empirical negative" by sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo. V19's specific stack is NOT directly invalidated. 2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855 base 1.06108 = -0.00059 BPB). Just 2 hparam env vars: MATRIX_LR 0.026 -> 0.028 PHASED_TTT_PREFIX_DOCS 2500 -> 3500 Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head). Adds two new scout scripts: - run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus + WD=2.0 (full stack, recommended first scout) - run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0 (ablation if V19c wins partially) Decision rule (CaseOps val baseline 0.97651, community floor 0.0006): V19c < 0.97591 -> CLEAR WIN, run 3-seed V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884) Other research findings: - PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip) - PR openai#1929 SLOT banned per openai#1722 precedent - PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent - cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1 - regina-openai + Alex Zhao 48h zero activity - CaseOps de-facto legal (PR openai#1855 merged into chain)

Co-authored-by: Codex <noreply@openai.com>

msisovic · 2026-04-29T18:34:31Z

Hi @cocohearts I notice the latest row removes my row, along a very similar entry to it. I assumed it was supposed to deduplicate, or was this intentional?

Co-authored-by: Codex <noreply@openai.com>

msisovic · 2026-04-29T19:00:52Z

Hi @cocohearts I notice the latest row removes my row, along a very similar entry to it. I assumed it was supposed to deduplicate, or was this intentional?

Took another look, the other entry wasn't a duplicate, just had a similar title. Still the point stands, my PR was removed even though it is a valid submission.

Co-authored-by: Codex <noreply@openai.com>

simon-marcus · 2026-04-29T22:18:56Z

@cocohearts flagging for the next leaderboard pass: my submission #1925 landed just outside the #1902 stated scan range (#1494-#1908), but stands as a seed-matched progression over #1855.

Current headline is 1.06031736 BPB, using the same seed set as #1855 (42, 0, 1234). This is a composite eval-only update on the original #1925 quantized artifacts: training/quantization are unchanged, and only the legal score-first TTT eval is updated to PHASED_TTT_PREFIX_DOCS=3500, PHASED_TTT_NUM_PHASES=1, TTT_LORA_LR=8e-5.

Matched deltas vs #1855:

seed 42: 1.05906444 vs 1.05989454 (-0.00083010)
seed 0: 1.06059202 vs 1.06124613 (-0.00065411)
seed 1234: 1.06129561 vs 1.06208695 (-0.00079134)
mean: 1.06031736 vs 1.06107587 (-0.00075852)

Composite logs are included as train_seed*_ttt_n1_lora8e5.log, with an explicit eval-only continuation marker and no model artifacts.

* Update parameter golf leaderboard with BOS fix Co-authored-by: Codex <noreply@openai.com> * Credit PR 1797 in leaderboard update Co-authored-by: Codex <noreply@openai.com> * Credit CaseOps and PR 1787 leaderboard rows Co-authored-by: Codex <noreply@openai.com> * Apply p-value progression leaderboard cutoff Co-authored-by: Codex <noreply@openai.com> * Address leaderboard review comments Co-authored-by: Codex <noreply@openai.com> * Clarify BOS fix leaderboard evidence Co-authored-by: Codex <noreply@openai.com> * Shorten leaderboard p-value notes Co-authored-by: Codex <noreply@openai.com> * Remove non-frontier leaderboard rows Co-authored-by: Codex <noreply@openai.com> * Clarify SmearGate BOS fix attribution Co-authored-by: Codex <noreply@openai.com> * Exclude openai#1518 from chronological frontier Co-authored-by: Codex <noreply@openai.com> * Use submitted openai#1855 score Co-authored-by: Codex <noreply@openai.com> * Restore openai#1529 chronological frontier Co-authored-by: Codex <noreply@openai.com> * Restore openai#1529 chronological frontier Co-authored-by: Codex <noreply@openai.com> --------- Co-authored-by: Codex <noreply@openai.com>

cocohearts and others added 3 commits April 28, 2026 11:57

Update parameter golf leaderboard with BOS fix

3a7c4be

Co-authored-by: Codex <noreply@openai.com>

Credit PR 1797 in leaderboard update

0d04647

Co-authored-by: Codex <noreply@openai.com>

Credit CaseOps and PR 1787 leaderboard rows

ce84ddc

Co-authored-by: Codex <noreply@openai.com>

Apply p-value progression leaderboard cutoff

13e76ac

Co-authored-by: Codex <noreply@openai.com>

Address leaderboard review comments

dbc31de

Co-authored-by: Codex <noreply@openai.com>

cocohearts and others added 3 commits April 29, 2026 11:20

Clarify BOS fix leaderboard evidence

4f06140

Co-authored-by: Codex <noreply@openai.com>

Shorten leaderboard p-value notes

f10108d

Co-authored-by: Codex <noreply@openai.com>

Remove non-frontier leaderboard rows

c0a8819

Co-authored-by: Codex <noreply@openai.com>

cocohearts and others added 3 commits April 29, 2026 11:39

Clarify SmearGate BOS fix attribution

464db95

Co-authored-by: Codex <noreply@openai.com>

Exclude #1518 from chronological frontier

69a8997

Co-authored-by: Codex <noreply@openai.com>

Use submitted #1855 score

66a076a

Co-authored-by: Codex <noreply@openai.com>

cocohearts and others added 2 commits April 29, 2026 12:08

Restore #1529 chronological frontier

53ee400

Co-authored-by: Codex <noreply@openai.com>

Restore #1529 chronological frontier

ff13842

Co-authored-by: Codex <noreply@openai.com>

cocohearts merged commit bea92e7 into main Apr 29, 2026

ndokutovich mentioned this pull request Apr 30, 2026

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967

Open

cocohearts mentioned this pull request May 2, 2026

Update leaderboard with May 1 audited rows #2146

Merged

simon-marcus mentioned this pull request May 4, 2026

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1 #1925

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Parameter Golf leaderboard with BOS fix#1902

Update Parameter Golf leaderboard with BOS fix#1902
cocohearts merged 13 commits intomainfrom
codex/update-parameter-golf-leaderboard-with-bosfix

cocohearts commented Apr 28, 2026 •

edited

Loading

Uh oh!

codemath3000 commented Apr 28, 2026 •

edited

Loading

Uh oh!

codemath3000 commented Apr 28, 2026 •

edited

Loading

Uh oh!

msisovic commented Apr 28, 2026

Uh oh!

cocohearts commented Apr 28, 2026

Uh oh!

cocohearts commented Apr 28, 2026

Uh oh!

msisovic commented Apr 28, 2026

Uh oh!

codemath3000 commented Apr 28, 2026

Uh oh!

CiprianFlorin-Ifrim commented Apr 28, 2026

Uh oh!

serdardoesml commented Apr 29, 2026

Uh oh!

msisovic commented Apr 29, 2026

Uh oh!

msisovic commented Apr 29, 2026

Uh oh!

simon-marcus commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

cocohearts commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codemath3000 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codemath3000 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Timeline / scored value at opening

Uh oh!

msisovic commented Apr 28, 2026

Uh oh!

cocohearts commented Apr 28, 2026

Uh oh!

cocohearts commented Apr 28, 2026

Uh oh!

msisovic commented Apr 28, 2026

Uh oh!

codemath3000 commented Apr 28, 2026

Uh oh!

CiprianFlorin-Ifrim commented Apr 28, 2026

Uh oh!

serdardoesml commented Apr 29, 2026

Uh oh!

msisovic commented Apr 29, 2026

Uh oh!

msisovic commented Apr 29, 2026

Uh oh!

simon-marcus commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cocohearts commented Apr 28, 2026 •

edited

Loading

codemath3000 commented Apr 28, 2026 •

edited

Loading

codemath3000 commented Apr 28, 2026 •

edited

Loading