Update leaderboard with May 1 audited rows by cocohearts · Pull Request #2146 · openai/parameter-golf

cocohearts · 2026-05-02T18:08:54Z

Summary

Adds the audited post-#1902 leaderboard progression rows using the maintainer grace policy: a PR opened before the May 1, 2026 5:00 PM Pacific cutoff can count when the original idea/code was public before cutoff and later commits only supplied validation/results or narrowly compliance-restoring fixes.

Rows added:

PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 at V21 v2 commit 70067534: 1.05943, p=0.034 vs PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855
PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953: 1.05855, p=0.063 vs PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 V21 v2
PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014: 1.05759, p=0.011 vs PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953
PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135: 1.05651, p=0.014 vs PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014

Audit Notes

I scanned 192 PRs (#1944-#2140) created after the #1902 leaderboard merge and before the cutoff, using parallel Codex shard graders plus a final global chronological reconciliation.

Grace-policy handling:

PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 is included: the PR and full code/scaffold were opened pre-cutoff, later commits supplied the 3-seed logs/results, and the submitted logs use clean canonical CaseOps data (train_shards: 80, no doubled local datasets/datasets path, val_tokens: 47851520).
PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's corrected token-only state is treated as technically acceptable under grace, but it is not a leaderboard row because earlier PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 has the lower score (1.05651 vs 1.05702).

Notable exclusions:

PR Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) #2019: valid-looking, but not a chronological frontier because earlier PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 already scored lower.
PR Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950, Record: LongCtx No-QV QK5.25 + AsymLogit — 1.05899 BPB 3-seed mean #2007, Add LR0.85 prefix2750 legal TTT record #2047, Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean #2060, Record: PR #1855 + AWQ-lite + AsymLogit + GradCentral val_bpb=1.05845 #2101: valid-looking or support rows, but not chronological frontiers after earlier better rows.
PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 and PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130: invalid due train/validation document overlap in the submitted CaseOps data construction.
PR Record: SP8192 V21 + Inside-timer N-gram TTT (no Gated XSA) — val_bpb 1.05692 (3-seed mean) #2041, Record [corrected] : 1.05770 Gated XSA + token-only n-gram tilt + LQER top-1 + AWQ-lite + AsymLogit) with GPTQ_RESERVE_SECONDS=2.0 and corrected CaseOps data preparation #2118, and PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140 original state: active within/word n-gram gates, same C1 target-token issue; Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's later corrected token-only state is acceptable but non-frontier.
PR Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer Full-Val - val_bpb 1.015784 #2039, Record: SP8192 + Sliding-Window Eval + Lock-In Byte Mixer - val_bpb 1.067219 #2138: CaseOps/cond-PPM BPB used transformed/reconstructed byte denominator instead of raw validation byte sidecar.
PR Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean) #1991, SP8192 Byte-PPM O=5 + V6 micro, 3-seed mean 0.92967555 BPB #2076, Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080, Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083, Record: PR #1873 base + tuned PPM gate (T=0.7/H=0.99/L=0.3) — val_bpb 0.80051 (3-seed mean) #2098, Add 16MB SP1024 Value Residual + PPM mixture submission ppm_mix_bpb 0.829467 #2103: byte/PPM mixers failed the full-normalized-distribution audit.

Co-authored-by: Codex <noreply@openai.com>

anmarhindi · 2026-05-02T18:22:20Z

Thank you @cocohearts and all the participants. Until next time!

codemath3000 · 2026-05-02T18:23:36Z

@cocohearts Thanks for the audit work. I'm on the same page regarding most of the exclusions. One pushback on the rationale for PR #2135.

Precedent. PR #1851 was filed without sufficient run logs/results, and PR #1868 supplied that evidence later, after PR #1855 had already beaten #1851/#1868. The combined submission was still accepted as part of the leaderboard record. That is the structural situation PR #2135 cites: code pre-cutoff, logs and results landing afterward.

Consistency. The rationale for excluding PR #2135 appears to be "all logs and results in by cutoff." That criterion is not stated in the README, and applying it consistently would retroactively invalidate PR #1851's leaderboard spot: PR #1851's scored 3-seed state was only complete once PR #1868 supplied the missing logs, by which point PR #1855 had already beaten the #1851/#1868 record. Under a "logs and results in by cutoff" rule, PR #1851/#1868 would have been beaten before its submission was complete, and could never have taken its leaderboard spot in the first place.

README rule. "Since all submissions are public, we're accepting record submissions chronologically depending on their PR creation time." That pins ordering on PR creation time, not on when logs and results reach completion.

Timing and code surface. PR #2135 was opened at 2026-05-01 23:48:57Z (4:48 PM PT, ~11 minutes before the 5 PM PT cutoff). The finalized 3-seed results were pushed afterward, but the PR itself was filed pre-cutoff. The pre-cutoff commit (be7420d) shipped the full code surface: train_gpt.py, lossless_caps.py, online_ngram_state.c, online_ngram_tilt.py, prepare_caseops_data.py, the tokenizer .model, requirements.txt, and scaffold README/submission.json. The two post-cutoff commits (f086a9f, ff90522) only added train_seed*.log run outputs and filled result numerics into README.md and submission.json. No methodology, architecture, training script, or tokenizer changed afterward.

Reproducibility. The seed log files added afterward are run outputs of the pre-cutoff code, not new submission content. Anyone with 8xH100 SXM access could clone be7420d and regenerate the runs within run-to-run noise. What was "missing" at cutoff wasn't the submission; it was the empirical measurement of a submission that was already public and reproducible.

Conclusion. The README's PR-creation-time rule and the #1851/#1868 precedent both place PR #2135 on the same footing as PR #1851/#1868 for leaderboard inclusion: PR opened pre-cutoff, full code surface in-tree pre-cutoff, logs and results landing afterward. PR #2135 is a valid record submission and should be included on the leaderboard.

Thanks so much again for taking the time to review these submissions.

andrewbaggio1 · 2026-05-02T18:31:41Z

congrats everyone!!

leon2k2k2k · 2026-05-02T18:46:21Z

Thanks for the thorough audit @cocohearts. One flag before this merges: PR #2130 has the same train/val document overlap as PR #2018, which you correctly excluded.

Evidence: SUBMISSION_FINAL/train_seed314.log line 1: train_shards: 1499

Per Issue #2127, train_shards: 1499 is the fingerprint of a local prepare_caseops_data.py run with the default --val-docs=10000. Train starts at document 10,000; the competition val set covers documents 0–49,999. Result: documents 10,000–49,999 (40k docs, 80% of the val set) appear in both training data and the scored val split.

By contrast, the other three rows you included (#1945, #1953, #2014) all use snapshot_download from romeerp/parameter-golf-caseops-v1 and report train_shards: 80 with an explicit 50,000-doc val split — those are clean.

PR #2130's README claims "Same dependencies and CaseOps tokenizer/shards as merged PR #1855" but the log contradicts this: PR #1855 uses the canonical HF dataset (80 shards), whereas #2130 generated data locally with the leaky default. The claimed 1.05670 should be excluded on the same grounds as #2018.

codemath3000 · 2026-05-02T18:58:59Z

@cocohearts @leon2k2k2k Acknowledged on PR #2130; the train_shards:1499 + doubled datasets/datasets path is hard to read any other way.

Worth noting for the leaderboard reconciliation: PR #2135 does not share the same validity issue as PR #2130 even though it's based on #2130. PR #2135 inherits PR #2130's train_gpt.py byte-for-byte but prepared its data fresh from the canonical romeerp/parameter-golf-caseops-v1 HF dataset, not from the leaky local prep. All three PR #2135 seed logs show train_shards: 80, datasets_dir: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved (no doubled nesting), and val_tokens: 47851520 matching PR #1855's canonical baseline. The leak fingerprint Issue #2127 identified is absent.

Therefore, #2135 is valid regardless of the validity of #2130.

Semi-unrelated, but also flagging that numerous people on the Discord have raised reproducibility questions about PR #2014. Relevant since the chain backs up to PR #2014 if PR #2130 is excluded.

simon-marcus · 2026-05-02T19:27:32Z

Small update on #2140: I agree with the exclusion rationale for the originally submitted state because it accidentally had the within-word / word-start / agreement n-gram channels active.

I’ve pushed a corrective commit restoring the intended token-only posture I had in #2018. The corrected logs now report:

within_gate=0 word_gate=0 agree2plus=0

Corrected 3-seed mean is 1.05701907 BPB, with eval times under 600s. The corrected PR title/README/submission metadata/logs have been updated accordingly.

I defer to maintainers on how to treat the timing/eligibility question for #2140 in this audit PR, but wanted to make the technical correction visible here.

Thanks again to @cocohearts @0hq @valerio-oai and the other golfers for a great competition!

codemath3000 · 2026-05-02T19:49:16Z

@cocohearts @0hq @valerio-oai Separate from any of the specific-submission discussions, just want to say thank you to all three of you for the work that went into making this contest possible. It's a genuinely interesting challenge, accessible to people without industrial-scale compute, with a records archive that's already a serious body of method ideas. The audit and review work on top of that, at this scale, is real labor. None of that exists without you. Appreciated.

Co-authored-by: Codex <noreply@openai.com>

cocohearts · 2026-05-02T21:01:02Z

Updated after applying the grace policy:

PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 is now included in this README update. The PR/code/scaffold were opened before cutoff, the later commits supplied logs/results, and the logs verify clean canonical CaseOps data (train_shards: 80, no doubled datasets/datasets path, val_tokens: 47851520).
PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's corrected token-only state is treated as technically acceptable under grace, but it still does not add a row because earlier PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 is lower: 1.05651 vs 1.05702.
PR Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) #2019 is valid-looking but not a chronological frontier because earlier PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 already scored lower: 1.05759 vs 1.05847.
PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130 remains excluded for the train/validation document-overlap issue.

The README diff is now four rows: #1945 V21 v2, #1953, #2014, and #2135.

simon-marcus · 2026-05-02T21:26:41Z

Thanks @cocohearts.

One small note, broadly consistent with @codemath3000’s point about timing and precedent: like his approved #2135, my PR was opened before the cutoff. My #2140 is admittedly not identical to #2135, since my corrective commit did touch code, not just logs. But that change was narrowly compliance-restoring: it disabled the unintended target-token-gated channels and moved the score worse, from the originally submitted number to 1.05701907.

I'd humbly hope maintainers decide there is room for a little grace here, for my compliance correction clearly made in the spirit of the competition, especially since there was a little uncertainty about precisely when the cutoff would be and what adjustments might be permitted.

If the rule is strictly “final scored state pushed before cutoff,” I understand the exclusion. Thanks for your consideration.

aquariouseworkman · 2026-05-02T21:46:34Z

@cocohearts why did 2019 not make it?

codemath3000 · 2026-05-02T21:53:34Z

Thanks @simon-marcus. Just to clarify the current state for context: under @cocohearts's most recent call, PR #2135 is also still on the excluded side, applying the "scored-state-pushed-before-cutoff" rule. @simon-marcus, I do agree with your broader point about leaving room for narrowly measurement-completing or compliance-restoring work that lands shortly after cutoff, especially when the PR itself was filed pre-cutoff. I'd similarly hope #2135 receives the same consideration.

Co-authored-by: Codex <noreply@openai.com>

cocohearts · 2026-05-02T23:36:05Z

Applying the maintainer grace policy here: if the original idea/code was submitted before cutoff, then later validation/results or narrowly compliance-restoring fixes can be considered.

Changes made in this PR:

Added PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 as the new top row: 3-seed mean 1.05650768, p=0.014 vs PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014. I verified the logs use clean canonical CaseOps data (train_shards: 80, no doubled datasets/datasets path, val_tokens: 47851520) and token-only n-gram gates.
Treated PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's corrected token-only state as technically acceptable under grace, but not a row because PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 was opened earlier and scores lower (1.05651 vs 1.05702).
PR Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) #2019 did not make the table because PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 was opened earlier and scores lower (1.05759 vs 1.05847).
PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130 remains excluded due the train/validation document overlap fingerprint.

codemath3000 · 2026-05-02T23:40:41Z

@cocohearts Genuinely grateful for the thoughtful work on this. Especially appreciate the willingness to take another look after the discussion played out. If anything else needs clarifying or more context from me later on, just let me know.

andrewbaggio1 · 2026-05-03T00:15:20Z

@cocohearts I want to ask a question about the scope of the grace policy. It says that it's for compliance-restoring work that was shortly after the cutoff, which fits what @simon-marcus said, but codemath's PR #2135 is different to me. It was one of six parallel PRs that he filed two minutes before the end of the competition, each with a distinct hyperparameter variant on two different bases. He closed five of them, and only #2135 remained because its post-cutoff completion was competitive. In my opinion, that's post-hoc selection from a parallel sweep, not compliance restoring work of an intended submission. Under that reading, releasing N number null BPB sweep PRs at the deadline becomes a viable strategy. You keep whichever one lands well, you close the rest, and then you claim the winner as the intended one.

I respected the deadline strictly. I even killed an in-progress run at the cutoff, and I assumed others did too. So, I think it's worth clarifying whether reviewer grace covers sweep selection.

I'm not contesting the technical work, but I am raising this as a policy interpretation question.

codemath3000 · 2026-05-03T00:21:26Z

@andrewbaggio1 To put the "parallel sweep" framing on accurate footing:

The six PRs were a 2x3 cross of two existing record candidates ({PR #2014, PR #2130}) against three explicit levers ({GATE_WINDOW=8, GPTQ_CALIBRATION_BATCHES=32, both stacked}). Both bases were publicly-filed record candidates in the repo at the time. Both levers were knobs already characterized in earlier public work, each known to help on at least one stack. The exploration was directed at one specific transfer question, not a hyperparameter scan.

The closure descriptions on the five scaffolds explain exactly why each variant didn't make it. PR 2131-2133 (the PR #2014-base variants) closed because PR #2014's published numbers haven't been independently reproducible (a concern multiple participants have separately raised on the contest Discord). That's a base-level issue that has nothing to do with our submissions; those PRs would have closed even if PR #2135 had landed worse than PR #2130. PR 2134/2136 (GW=8 on the PR #2130 base) closed because GW=8 was net-negative on that stack. That's a measured technical finding about the lever, not a "this didn't happen to win" framing. Both sets of closures rest on the substance of what the runs revealed, not on PR #2135's relative ranking.

The substantive distinction is between undirected hyperparameter search over a wide space and a small, pre-identified, structured exploration of known levers on known bases. PR #2135 wasn't post-hoc selected from a much larger space; it was one of six explicit variants, all defined in code shipped pre-cutoff. Two levers across two bases is also far too narrow a search space for sweep-style selection to be effective at all; that strategy requires many more degrees of freedom to plausibly produce a winner from noise. For reference, when I have actually run hyperparameter sweeps in this work, those have ranged upwards of 50 configurations.

Also worth re-anchoring on @cocohearts's stated policy: "if the original idea/code was submitted before cutoff, then later validation/results or narrowly compliance-restoring fixes can be considered." That enumerates two distinct cases. PR #2135 falls under the first, since the post-cutoff content was log files and result numerics for code shipped pre-cutoff, which is what "later validation/results" describes. PR #2140 is the second-case example. The policy as written explicitly covers this situation.

Finally, the README doesn't place a cap on simultaneous PRs, and multiple-PR workflows have been standard throughout this contest. Treating PR #2135 differently for a pattern no written rule covers would be inconsistent.

andrewbaggio1 · 2026-05-03T00:26:50Z

thank you @codemath3000 for clarifying your intent, but, you said that the reason for closing #2131-2133 was because #2014 wasn't reproducible and #2134/#2136 because GW=8 was net negative. That's still selecting on outcomes, regardless of whether the variance was small and predefined or large and undirected.

Each of your six PRs was also titled "record candidate", not "ablation" or "non-record exploration", which suggests that at the filing time, each was competing for a slot, not answering a research question. If your goal was the transfer question, one PR with the cross-product results would have answered it. I'm just flagging this distinction for the policy reading and not insisting on any specific outcome. That's ultimately up to @cocohearts

codemath3000 · 2026-05-03T00:33:27Z

@andrewbaggio1 Concrete context I should have led with: PR #2130 landed at 23:22:40Z, ~37 minutes before the 00:00:00Z cutoff. The standard "run, observe results, submit if it beats baseline" pipeline requires >20 minutes per run on 8xH100, plus the additional time to compose and submit a PR after results land. Fitting even one sequential filter run into the 37-minute window would have been extremely tight; fitting two was impossible. And RunPod's per-account hourly spend limits prevented me from making effective use of a parallel approach either.

The two levers (GATE_WINDOW=8, GPTQ_CALIBRATION_BATCHES=32) weren't random hyperparameters; both had been tested on PR #2014's stack pre-cutoff and shown to improve it relative to baseline. PR #2014 itself turned out non-reproducible at its headline, but the levers' positive effect was a meaningful pre-cutoff signal. The open question was whether either or both would transfer to the PR #2130 stack with its additional knobs. PR #2130 had also been filed too recently for me to independently verify its compliance before the cutoff, so keeping PR #2014 as a parallel base was a fallback in case PR #2130 turned out non-viable. Hedging at filing time against the possibility of a non-compliant base isn't "selection on outcomes"; it's standard risk management under uncertainty. That's three lever combinations across two bases, giving the 2x3.

Filing each combination as a separate PR wasn't strategy to maximize slots; it was the only way to keep all pre-cutoff candidates submittable, since I had no time to filter pre-cutoff and contest convention is one candidate per PR. With another hour pre-cutoff, I'd have tested sequentially and filed one PR. The alternative under a strict no-fan-out reading would have been filing zero candidates after PR #2130 landed, since none could have been pre-validated in the time available. Filing record-candidate PRs and closing the ones that don't pan out is also routine practice in this contest; dexhunter, the contributor with the most records, has done this multiple times.

Happy to provide further detail on any of this if useful.

andrewbaggio1 · 2026-05-03T01:02:06Z

@cocohearts some of the recent comments here have been edited substantially since they were posted. if you're using codex to review this thread, iirc github's comment endpoint only returns the current body. pls pull edit history too

codemath3000 · 2026-05-03T01:09:11Z

@andrewbaggio1 Both of us have edited comments in this thread, which is normal. My edits have been for clarity, quality, and supporting context, with no claim retracted or position shifted. Anyone can verify this directly from the publicly available edit history if needed.

CiprianFlorin-Ifrim · 2026-05-03T09:23:06Z

@cocohearts @valerio-oai There's one submission that was accepted and added to the non-record, the first one that properly looked at depth recurrence from @evangelinehelsinki, that was never added to the Notable Non-record Leaderboard. Can that be fixed, given it was already accepted by @0hq and he probably forgot to then add it to the leaderboard.

Separately, I also have a submission #923, for the same leaderboard, that does ternary (my previously accepted V1 Ternary to the main leaderboard), with unlimited compute reaching 1.10. which would put it at the top of the non-record leaderboard. There's also a separate main leaderboard ternary submission, the v2 #923 , building on the accepted one, that does add more params in the same 16MB package under ternary, and gets a lower bpb. If that can be either accepted as a separate entry, added under the already existing v1 (combining them in 1 row) or just closing it if rejected.

cocohearts · 2026-05-03T16:12:48Z

non-record leaderboard will be updated this week pls dont spam tysm

CiprianFlorin-Ifrim · 2026-05-03T17:11:34Z

@cocohearts There has been no spam, my previous comment focused primarily on a submission that has previously been accepted, over 1 month ago, and yet never added to the leaderboard. A submission that isn't mine to begin with.

My 2 submissions that I tagged were already confirmed by Will to me on Discord, and never merged/added, especially given he "retired". Valerio started accepting some submission for the non-leaderboard in the last 24h, ergo my comment here.

Another message I left on a different thread, references the many PRs until 1400 that were ignored in the latest revisions of the main/1st leaderboard which is unfair to the people that completed the work, especially given your own README states "we're accepting record submissions chronologically depending on their PR creation time". Another thing that has nothing to do with me which had to be mentioned as it should be clarified to the community.

Given these threads are made to update the community on the latest, and have others share anything on here, like I've done with my 2 messages, I'd say we are far from spam, and I'd appreciate if next time any comments made are accurate, tysm.

…AsymLogit Rescale, 7th BPB bug - logs/daily_research.md: new May 3 entry; DRAFT PR openai#2146 grace-policy audit adds 4 records (pending SOTA 1.05651 via PR openai#2135); AsymLogit Rescale documented (~5 lines, zero legality risk); PR openai#2124 seed/config inconsistency; PR openai#2138 BPB bug #7 confirmed; data overlap hazard in PR openai#2130 flagged; no new high-relevance papers beyond prior scan. - CLAUDE.md: Competition Strategy updated to reflect closed competition, pending audit status, and key post-competition findings (AsymLogit Rescale, GPTQ calibration batches, data overlap isolation requirement). https://claude.ai/code/session_013Q2rFE4xRHRRYaSPfzCiip

Covers post-competition audit status (PR openai#2146 draft, PR openai#2135 pending, PR openai#2130 excluded for data overlap), PPM-D Issue openai#1872 still unresolved, and new papers on looped transformer depth scaling and Hyperloop architecture. https://claude.ai/code/session_01NacrVng88z2EC9YGZp15CC

…erged, final SOTA 1.05651 PR openai#2146 merged May 1: audit complete, 4 grace-policy PRs accepted (openai#1945/openai#1953/openai#2014/openai#2135). Official final SOTA is 1.05651 (codemath3000, PR openai#2135). PR openai#2130 rejected for data overlap. CLAUDE.md updated to reflect completed audit and new top-5 leaderboard. https://claude.ai/code/session_01V4CoM1HMPmJDcDHzdtdy7X

cocohearts and others added 2 commits May 2, 2026 18:08

Update leaderboard with May 1 audited rows

dff916e

Co-authored-by: Codex <noreply@openai.com>

Clarify PR 2130 leaderboard attribution

cb46d19

Co-authored-by: Codex <noreply@openai.com>

leon2k2k2k mentioned this pull request May 2, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

Remove PR 2130 from leaderboard update

e8db356

Co-authored-by: Codex <noreply@openai.com>

Add PR 2135 under grace policy

bfc3a26

Co-authored-by: Codex <noreply@openai.com>

This was referenced May 2, 2026

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135

Open

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

simon-marcus mentioned this pull request May 4, 2026

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1 #1925

Open

8 tasks

cocohearts marked this pull request as ready for review May 4, 2026 20:07

cocohearts merged commit f5c0793 into main May 4, 2026

Conversation

cocohearts commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Audit Notes

Uh oh!

anmarhindi commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewbaggio1 commented May 2, 2026

Uh oh!

leon2k2k2k commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-marcus commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026

Uh oh!

cocohearts commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-marcus commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aquariouseworkman commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cocohearts commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewbaggio1 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codemath3000 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewbaggio1 commented May 3, 2026

Uh oh!

codemath3000 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewbaggio1 commented May 3, 2026

Uh oh!

codemath3000 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CiprianFlorin-Ifrim commented May 3, 2026

Uh oh!

cocohearts commented May 3, 2026

Uh oh!

CiprianFlorin-Ifrim commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cocohearts commented May 2, 2026 •

edited

Loading

codemath3000 commented May 2, 2026 •

edited

Loading

codemath3000 commented May 2, 2026 •

edited

Loading

cocohearts commented May 2, 2026 •

edited

Loading

simon-marcus commented May 2, 2026 •

edited

Loading

codemath3000 commented May 2, 2026 •

edited

Loading

codemath3000 commented May 2, 2026 •

edited

Loading

andrewbaggio1 commented May 3, 2026 •

edited

Loading

codemath3000 commented May 3, 2026 •

edited

Loading

codemath3000 commented May 3, 2026 •

edited

Loading

codemath3000 commented May 3, 2026 •

edited

Loading

CiprianFlorin-Ifrim commented May 3, 2026 •

edited

Loading