Record: SP8192 + 3-Layer Recurrence + Hard Onset — val_bpb 1.08625 (3-seed mean)#1663
Open
pablinga19 wants to merge 5 commits intoopenai:mainfrom
Open
Record: SP8192 + 3-Layer Recurrence + Hard Onset — val_bpb 1.08625 (3-seed mean)#1663pablinga19 wants to merge 5 commits intoopenai:mainfrom
pablinga19 wants to merge 5 commits intoopenai:mainfrom
Conversation
- Match heading, table, and section format from openai#1218/openai#1394 - Add Post-quant BPB column, bold Sliding BPB values - Add missing submission.json fields (hardware, bytes_total, bytes_code) - Remove Deltas and Reproducibility sections - Round val_bpb to 5 decimal places consistently Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move onset scheduling logic from inline training loop into update_recurrence_onset() with clear docstring documenting both modes - Add structured comments on RECUR_HOMOTOPY / RECUR_START_STEP env vars - Add "Where the change lives in code" section to README - Update bytes_code in submission.json Behavior is preserved: the function contains the exact same branching logic that was previously inline at the training-loop call site. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
… candidates User shared a deep timeline of all recurrence experiments in the PG competition (openai#8 through openai#1739). Several of my previously-proposed experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail: KILLED: - Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739 showed step-0 catastrophic (1.3936 bpb) - Progressive ramp: openai#1663 showed hard-onset = smooth, no difference - Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift +0.006 worse — layer 3-5 IS the empirical sweet spot Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5 (three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name suggests. 3 layers × 3 passes = 17 virtual layers. VIABLE candidates: - Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block, init 0 → identity. 6 params. Author's grant ran out before TTT eval so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK. - Cross-pass XSA: still novel, untested in any PR - Loop3-6 variant (openai#1678): tashapais running it; might wait for result Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015. ~$25, identity-at-init (safe), 30 LOC, direct recurrence question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Active research thread's first experiment. Pinned to commit a9aa141 on exp/recur-alpha. Key decisions baked in: - Screening mode first (~$6 total, skip TTT/GPTQ/EMA) - TRAIN_LOG_EVERY=100 for diagnostic resolution - p2p cosine diagnostic off by default (torch.compile concerns) - Single seed 42; conditional 3-seed + full TTT only if Δ ≤ -0.001 - Identity-at-init safety: α=0 = passthrough, worst case no change Three disproven recurrence-class experiments explicitly NOT in this spec (earlier activation openai#1726, schedule smoothing openai#1663, position shift openai#1726). Those would be wasted spend per existing PG evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: SP8192 + 3-Layer Recurrence + Hard Onset — val_bpb 1.08625
val bpb: 1.08625 (3-seed mean, std=0.0023)
This submission is a direct derivative of #1394. I keep that stack largely fixed and test one isolated change: replacing smooth recurrence onset with a hard activation at step 3000. The goal is to preserve more non-recurrent training within the fixed 600-second budget before enabling the recurrent virtual-layer sequence later in the run.
Results
All three runs use the same script, changing only the
SEEDenvironment variable.Why hard onset can help
Let$r_0$ denote training throughput before recurrence is enabled, and let $r_1 < r_0$ denote throughput after recurrence is active. Under a fixed wall-clock budget $T$ , a hard-onset schedule at step $s_0$ yields approximately
because the run first spends$s_0/r_0$ seconds in the non-recurrent regime, then uses the remaining time in the recurrent regime.
For a smooth-onset schedule, throughput begins to decline earlier. If$r(t)$ denotes the time-varying throughput during the ramp, then total realized steps are
with$r(t) < r_0$ over part of the interval before the hard switch point. In that setting, delaying recurrence can increase the number of realized optimization steps by allocating a larger fraction of the fixed budget to the higher-throughput non-recurrent regime, while still enabling recurrence later in training.
In this submission, enabling the 3-layer recurrence stack at step 3000 produced the reported 3-seed mean sliding val_bpb of 1.08625.
Where the change lives in code
This submission is a direct derivative of #1394 and isolates one scheduling change. The relevant code is in
train_gpt.py:Hyperparametersclass:RECUR_HOMOTOPY,RECUR_START_STEP,RECUR_HOMOTOPY_TMID,RECUR_HOMOTOPY_TAUupdate_recurrence_onset()— a single function containing both the hard and smooth pathslast_stepcheckWith
RECUR_HOMOTOPY=0(default), the function reduces to a one-line step threshold atRECUR_START_STEP=3000. Everything else in the script is unchanged from #1394.Notes
VAL_LOSS_EVERY=99999removes mid-training validation passes, increasing realized train steps within the same 600s budget.Note_V2: This submission is part of my application for additional compute. My independent resources are limited, so further progress depends on access to more credits. With that support, I’d like not only to continue refining this line, but also to investigate broader, more ambitious ideas around the same core problem