Update leaderboard with May 1 audited rows#2146
Conversation
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
|
Thank you @cocohearts and all the participants. Until next time! |
|
@cocohearts Thanks for the audit work. I'm on the same page regarding most of the exclusions. One pushback on the rationale for PR #2135. Precedent. PR #1851 was filed without sufficient run logs/results, and PR #1868 supplied that evidence later, after PR #1855 had already beaten #1851/#1868. The combined submission was still accepted as part of the leaderboard record. That is the structural situation PR #2135 cites: code pre-cutoff, logs and results landing afterward. Consistency. The rationale for excluding PR #2135 appears to be "all logs and results in by cutoff." That criterion is not stated in the README, and applying it consistently would retroactively invalidate PR #1851's leaderboard spot: PR #1851's scored 3-seed state was only complete once PR #1868 supplied the missing logs, by which point PR #1855 had already beaten the #1851/#1868 record. Under a "logs and results in by cutoff" rule, PR #1851/#1868 would have been beaten before its submission was complete, and could never have taken its leaderboard spot in the first place. README rule. "Since all submissions are public, we're accepting record submissions chronologically depending on their PR creation time." That pins ordering on PR creation time, not on when logs and results reach completion. Timing and code surface. PR #2135 was opened at 2026-05-01 23:48:57Z (4:48 PM PT, ~11 minutes before the 5 PM PT cutoff). The finalized 3-seed results were pushed afterward, but the PR itself was filed pre-cutoff. The pre-cutoff commit (be7420d) shipped the full code surface: train_gpt.py, lossless_caps.py, online_ngram_state.c, online_ngram_tilt.py, prepare_caseops_data.py, the tokenizer .model, requirements.txt, and scaffold README/submission.json. The two post-cutoff commits (f086a9f, ff90522) only added train_seed*.log run outputs and filled result numerics into README.md and submission.json. No methodology, architecture, training script, or tokenizer changed afterward. Reproducibility. The seed log files added afterward are run outputs of the pre-cutoff code, not new submission content. Anyone with 8xH100 SXM access could clone be7420d and regenerate the runs within run-to-run noise. What was "missing" at cutoff wasn't the submission; it was the empirical measurement of a submission that was already public and reproducible. Conclusion. The README's PR-creation-time rule and the #1851/#1868 precedent both place PR #2135 on the same footing as PR #1851/#1868 for leaderboard inclusion: PR opened pre-cutoff, full code surface in-tree pre-cutoff, logs and results landing afterward. PR #2135 is a valid record submission and should be included on the leaderboard. Thanks so much again for taking the time to review these submissions. |
|
congrats everyone!! |
|
Thanks for the thorough audit @cocohearts. One flag before this merges: PR #2130 has the same train/val document overlap as PR #2018, which you correctly excluded. Evidence: SUBMISSION_FINAL/train_seed314.log line 1: train_shards: 1499 Per Issue #2127, train_shards: 1499 is the fingerprint of a local prepare_caseops_data.py run with the default --val-docs=10000. Train starts at document 10,000; the competition val set covers documents 0–49,999. Result: documents 10,000–49,999 (40k docs, 80% of the val set) appear in both training data and the scored val split. By contrast, the other three rows you included (#1945, #1953, #2014) all use snapshot_download from romeerp/parameter-golf-caseops-v1 and report train_shards: 80 with an explicit 50,000-doc val split — those are clean. PR #2130's README claims "Same dependencies and CaseOps tokenizer/shards as merged PR #1855" but the log contradicts this: PR #1855 uses the canonical HF dataset (80 shards), whereas #2130 generated data locally with the leaky default. The claimed 1.05670 should be excluded on the same grounds as #2018. |
|
@cocohearts @leon2k2k2k Acknowledged on PR #2130; the train_shards:1499 + doubled datasets/datasets path is hard to read any other way. Worth noting for the leaderboard reconciliation: PR #2135 does not share the same validity issue as PR #2130 even though it's based on #2130. PR #2135 inherits PR #2130's Therefore, #2135 is valid regardless of the validity of #2130. Semi-unrelated, but also flagging that numerous people on the Discord have raised reproducibility questions about PR #2014. Relevant since the chain backs up to PR #2014 if PR #2130 is excluded. |
|
Small update on #2140: I agree with the exclusion rationale for the originally submitted state because it accidentally had the within-word / word-start / agreement n-gram channels active. I’ve pushed a corrective commit restoring the intended token-only posture I had in #2018. The corrected logs now report:
Corrected 3-seed mean is I defer to maintainers on how to treat the timing/eligibility question for #2140 in this audit PR, but wanted to make the technical correction visible here. Thanks again to @cocohearts @0hq @valerio-oai and the other golfers for a great competition! |
|
@cocohearts @0hq @valerio-oai Separate from any of the specific-submission discussions, just want to say thank you to all three of you for the work that went into making this contest possible. It's a genuinely interesting challenge, accessible to people without industrial-scale compute, with a records archive that's already a serious body of method ideas. The audit and review work on top of that, at this scale, is real labor. None of that exists without you. Appreciated. |
Co-authored-by: Codex <noreply@openai.com>
|
Updated after applying the grace policy:
The README diff is now four rows: #1945 V21 v2, #1953, #2014, and #2135. |
|
Thanks @cocohearts. One small note, broadly consistent with @codemath3000’s point about timing and precedent: like his approved #2135, my PR was opened before the cutoff. My #2140 is admittedly not identical to #2135, since my corrective commit did touch code, not just logs. But that change was narrowly compliance-restoring: it disabled the unintended target-token-gated channels and moved the score worse, from the originally submitted number to I'd humbly hope maintainers decide there is room for a little grace here, for my compliance correction clearly made in the spirit of the competition, especially since there was a little uncertainty about precisely when the cutoff would be and what adjustments might be permitted. If the rule is strictly “final scored state pushed before cutoff,” I understand the exclusion. Thanks for your consideration. |
|
@cocohearts why did 2019 not make it? |
|
Thanks @simon-marcus. Just to clarify the current state for context: under @cocohearts's most recent call, PR #2135 is also still on the excluded side, applying the "scored-state-pushed-before-cutoff" rule. @simon-marcus, I do agree with your broader point about leaving room for narrowly measurement-completing or compliance-restoring work that lands shortly after cutoff, especially when the PR itself was filed pre-cutoff. I'd similarly hope #2135 receives the same consideration. |
Co-authored-by: Codex <noreply@openai.com>
|
Applying the maintainer grace policy here: if the original idea/code was submitted before cutoff, then later validation/results or narrowly compliance-restoring fixes can be considered. Changes made in this PR:
|
|
@cocohearts Genuinely grateful for the thoughtful work on this. Especially appreciate the willingness to take another look after the discussion played out. If anything else needs clarifying or more context from me later on, just let me know. |
|
@cocohearts I want to ask a question about the scope of the grace policy. It says that it's for compliance-restoring work that was shortly after the cutoff, which fits what @simon-marcus said, but codemath's PR #2135 is different to me. It was one of six parallel PRs that he filed two minutes before the end of the competition, each with a distinct hyperparameter variant on two different bases. He closed five of them, and only #2135 remained because its post-cutoff completion was competitive. In my opinion, that's post-hoc selection from a parallel sweep, not compliance restoring work of an intended submission. Under that reading, releasing N number null BPB sweep PRs at the deadline becomes a viable strategy. You keep whichever one lands well, you close the rest, and then you claim the winner as the intended one. I respected the deadline strictly. I even killed an in-progress run at the cutoff, and I assumed others did too. So, I think it's worth clarifying whether reviewer grace covers sweep selection. I'm not contesting the technical work, but I am raising this as a policy interpretation question.
|
|
@andrewbaggio1 To put the "parallel sweep" framing on accurate footing: The six PRs were a 2x3 cross of two existing record candidates ({PR #2014, PR #2130}) against three explicit levers ({GATE_WINDOW=8, GPTQ_CALIBRATION_BATCHES=32, both stacked}). Both bases were publicly-filed record candidates in the repo at the time. Both levers were knobs already characterized in earlier public work, each known to help on at least one stack. The exploration was directed at one specific transfer question, not a hyperparameter scan. The closure descriptions on the five scaffolds explain exactly why each variant didn't make it. PR 2131-2133 (the PR #2014-base variants) closed because PR #2014's published numbers haven't been independently reproducible (a concern multiple participants have separately raised on the contest Discord). That's a base-level issue that has nothing to do with our submissions; those PRs would have closed even if PR #2135 had landed worse than PR #2130. PR 2134/2136 (GW=8 on the PR #2130 base) closed because GW=8 was net-negative on that stack. That's a measured technical finding about the lever, not a "this didn't happen to win" framing. Both sets of closures rest on the substance of what the runs revealed, not on PR #2135's relative ranking. The substantive distinction is between undirected hyperparameter search over a wide space and a small, pre-identified, structured exploration of known levers on known bases. PR #2135 wasn't post-hoc selected from a much larger space; it was one of six explicit variants, all defined in code shipped pre-cutoff. Two levers across two bases is also far too narrow a search space for sweep-style selection to be effective at all; that strategy requires many more degrees of freedom to plausibly produce a winner from noise. For reference, when I have actually run hyperparameter sweeps in this work, those have ranged upwards of 50 configurations. Also worth re-anchoring on @cocohearts's stated policy: "if the original idea/code was submitted before cutoff, then later validation/results or narrowly compliance-restoring fixes can be considered." That enumerates two distinct cases. PR #2135 falls under the first, since the post-cutoff content was log files and result numerics for code shipped pre-cutoff, which is what "later validation/results" describes. PR #2140 is the second-case example. The policy as written explicitly covers this situation. Finally, the README doesn't place a cap on simultaneous PRs, and multiple-PR workflows have been standard throughout this contest. Treating PR #2135 differently for a pattern no written rule covers would be inconsistent. |
|
thank you @codemath3000 for clarifying your intent, but, you said that the reason for closing #2131-2133 was because #2014 wasn't reproducible and #2134/#2136 because GW=8 was net negative. That's still selecting on outcomes, regardless of whether the variance was small and predefined or large and undirected. Each of your six PRs was also titled "record candidate", not "ablation" or "non-record exploration", which suggests that at the filing time, each was competing for a slot, not answering a research question. If your goal was the transfer question, one PR with the cross-product results would have answered it. I'm just flagging this distinction for the policy reading and not insisting on any specific outcome. That's ultimately up to @cocohearts |
|
@andrewbaggio1 Concrete context I should have led with: PR #2130 landed at 23:22:40Z, ~37 minutes before the 00:00:00Z cutoff. The standard "run, observe results, submit if it beats baseline" pipeline requires >20 minutes per run on 8xH100, plus the additional time to compose and submit a PR after results land. Fitting even one sequential filter run into the 37-minute window would have been extremely tight; fitting two was impossible. And RunPod's per-account hourly spend limits prevented me from making effective use of a parallel approach either. The two levers (GATE_WINDOW=8, GPTQ_CALIBRATION_BATCHES=32) weren't random hyperparameters; both had been tested on PR #2014's stack pre-cutoff and shown to improve it relative to baseline. PR #2014 itself turned out non-reproducible at its headline, but the levers' positive effect was a meaningful pre-cutoff signal. The open question was whether either or both would transfer to the PR #2130 stack with its additional knobs. PR #2130 had also been filed too recently for me to independently verify its compliance before the cutoff, so keeping PR #2014 as a parallel base was a fallback in case PR #2130 turned out non-viable. Hedging at filing time against the possibility of a non-compliant base isn't "selection on outcomes"; it's standard risk management under uncertainty. That's three lever combinations across two bases, giving the 2x3. Filing each combination as a separate PR wasn't strategy to maximize slots; it was the only way to keep all pre-cutoff candidates submittable, since I had no time to filter pre-cutoff and contest convention is one candidate per PR. With another hour pre-cutoff, I'd have tested sequentially and filed one PR. The alternative under a strict no-fan-out reading would have been filing zero candidates after PR #2130 landed, since none could have been pre-validated in the time available. Filing record-candidate PRs and closing the ones that don't pan out is also routine practice in this contest; dexhunter, the contributor with the most records, has done this multiple times. Happy to provide further detail on any of this if useful. |
|
@cocohearts some of the recent comments here have been edited substantially since they were posted. if you're using codex to review this thread, iirc github's comment endpoint only returns the current body. pls pull edit history too |
|
@andrewbaggio1 Both of us have edited comments in this thread, which is normal. My edits have been for clarity, quality, and supporting context, with no claim retracted or position shifted. Anyone can verify this directly from the publicly available edit history if needed. |
|
@cocohearts @valerio-oai There's one submission that was accepted and added to the non-record, the first one that properly looked at depth recurrence from @evangelinehelsinki, that was never added to the Notable Non-record Leaderboard. Can that be fixed, given it was already accepted by @0hq and he probably forgot to then add it to the leaderboard. Separately, I also have a submission #923, for the same leaderboard, that does ternary (my previously accepted V1 Ternary to the main leaderboard), with unlimited compute reaching 1.10. which would put it at the top of the non-record leaderboard. There's also a separate main leaderboard ternary submission, the v2 #923 , building on the accepted one, that does add more params in the same 16MB package under ternary, and gets a lower bpb. If that can be either accepted as a separate entry, added under the already existing v1 (combining them in 1 row) or just closing it if rejected. |
|
non-record leaderboard will be updated this week pls dont spam tysm |
|
@cocohearts There has been no spam, my previous comment focused primarily on a submission that has previously been accepted, over 1 month ago, and yet never added to the leaderboard. A submission that isn't mine to begin with. My 2 submissions that I tagged were already confirmed by Will to me on Discord, and never merged/added, especially given he "retired". Valerio started accepting some submission for the non-leaderboard in the last 24h, ergo my comment here. Another message I left on a different thread, references the many PRs until 1400 that were ignored in the latest revisions of the main/1st leaderboard which is unfair to the people that completed the work, especially given your own README states "we're accepting record submissions chronologically depending on their PR creation time". Another thing that has nothing to do with me which had to be mentioned as it should be clarified to the community. Given these threads are made to update the community on the latest, and have others share anything on here, like I've done with my 2 messages, I'd say we are far from spam, and I'd appreciate if next time any comments made are accurate, tysm. |
…AsymLogit Rescale, 7th BPB bug - logs/daily_research.md: new May 3 entry; DRAFT PR openai#2146 grace-policy audit adds 4 records (pending SOTA 1.05651 via PR openai#2135); AsymLogit Rescale documented (~5 lines, zero legality risk); PR openai#2124 seed/config inconsistency; PR openai#2138 BPB bug #7 confirmed; data overlap hazard in PR openai#2130 flagged; no new high-relevance papers beyond prior scan. - CLAUDE.md: Competition Strategy updated to reflect closed competition, pending audit status, and key post-competition findings (AsymLogit Rescale, GPTQ calibration batches, data overlap isolation requirement). https://claude.ai/code/session_013Q2rFE4xRHRRYaSPfzCiip
Covers post-competition audit status (PR openai#2146 draft, PR openai#2135 pending, PR openai#2130 excluded for data overlap), PPM-D Issue openai#1872 still unresolved, and new papers on looped transformer depth scaling and Hyperloop architecture. https://claude.ai/code/session_01NacrVng88z2EC9YGZp15CC
…erged, final SOTA 1.05651 PR openai#2146 merged May 1: audit complete, 4 grace-policy PRs accepted (openai#1945/openai#1953/openai#2014/openai#2135). Official final SOTA is 1.05651 (codemath3000, PR openai#2135). PR openai#2130 rejected for data overlap. CLAUDE.md updated to reflect completed audit and new top-5 leaderboard. https://claude.ai/code/session_01V4CoM1HMPmJDcDHzdtdy7X

Summary
Adds the audited post-#1902 leaderboard progression rows using the maintainer grace policy: a PR opened before the May 1, 2026 5:00 PM Pacific cutoff can count when the original idea/code was public before cutoff and later commits only supplied validation/results or narrowly compliance-restoring fixes.
Rows added:
70067534: 1.05943, p=0.034 vs PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855Audit Notes
I scanned 192 PRs (#1944-#2140) created after the #1902 leaderboard merge and before the cutoff, using parallel Codex shard graders plus a final global chronological reconciliation.
Grace-policy handling:
train_shards: 80, no doubled localdatasets/datasetspath,val_tokens: 47851520).Notable exclusions: