Non-record: Depth Recurrence Sweep — Systematic Layer Loop Ablation by krishs0404 · Pull Request #1726 · openai/parameter-golf

krishs0404 · 2026-04-18T23:27:03Z

Summary

Systematic ablation of depth recurrence loop configuration in the current SOTA training stack. Five variants tested against the baseline across three axes: which layers to loop (LOOP_START, LOOP_END) and when to activate the loop (ENABLE_LOOPING_AT). Every variant was worse than the SOTA config.

Best result: 1.4689 post-quant bpb (SOTA baseline, included as reference)
Hardware: RunPod 1×H100 SXM 80GB, 6 runs × 10-min wallclock cap ≈ ~60 GPU-minutes total

Results

Experiment	`LOOP_START`	`LOOP_END`	`ENABLE_LOOPING_AT`	Steps	val_bpb (post-quant)	Δ vs baseline
Baseline (SOTA)	3	5	0.35	568	1.4689	—
A — minimal reuse	5	6	0.35	565	1.4750	+0.006
D — early layers	1	4	0.35	538	1.5072	+0.038
C — late layers	7	10	0.35	541	1.5181	+0.049
E — early activation	3	5	0.15	522	1.5190	+0.050
B — heavy reuse	2	7	0.35	451	1.6321	+0.163

Key Findings

The SOTA config is genuinely optimal. Moving the loop range by a few layers in either direction costs +0.04–0.05 bpb. Not arbitrary.
Minimal reuse (layers 5–6 only) is surprisingly competitive at +0.006 bpb — most of the recurrence benefit is concentrated in those two layers.
Heavy reuse (layers 2–7) is catastrophically worse (+0.163) due to throughput loss: only 451 steps vs 568 for baseline at the 10-min budget.
Early/late layer shifts are symmetrically bad (~+0.04–0.05 bpb). Middle layers (3–5) are the sweet spot.
Early loop activation (15%) hurts: 46 fewer steps and +0.050 bpb. Representations need to stabilize before recurrence is useful.

See README.md for full analysis, encoder/decoder index lists, and reproduce commands.

Systematic ablation of depth recurrence loop configuration (LOOP_START, LOOP_END, ENABLE_LOOPING_AT) against the current SOTA stack. 6 experiments on 1×H100 SXM, 10-min wallclock cap each. SOTA config (layers 3–5, ELA=0.35) confirmed optimal. Key finding: minimal 2-layer variant (layers 5–6) is surprisingly close (+0.006 bpb); heavy reuse (layers 2–7) is catastrophic (+0.163 bpb) due to step-count loss from per-step compute overhead.

… candidates User shared a deep timeline of all recurrence experiments in the PG competition (openai#8 through openai#1739). Several of my previously-proposed experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail: KILLED: - Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739 showed step-0 catastrophic (1.3936 bpb) - Progressive ramp: openai#1663 showed hard-onset = smooth, no difference - Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift +0.006 worse — layer 3-5 IS the empirical sweet spot Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5 (three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name suggests. 3 layers × 3 passes = 17 virtual layers. VIABLE candidates: - Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block, init 0 → identity. 6 params. Author's grant ran out before TTT eval so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK. - Cross-pass XSA: still novel, untested in any PR - Loop3-6 variant (openai#1678): tashapais running it; might wait for result Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015. ~$25, identity-at-init (safe), 30 LOC, direct recurrence question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Active research thread's first experiment. Pinned to commit a9aa141 on exp/recur-alpha. Key decisions baked in: - Screening mode first (~$6 total, skip TTT/GPTQ/EMA) - TRAIN_LOG_EVERY=100 for diagnostic resolution - p2p cosine diagnostic off by default (torch.compile concerns) - Single seed 42; conditional 3-seed + full TTT only if Δ ≤ -0.001 - Identity-at-init safety: α=0 = passthrough, worst case no change Three disproven recurrence-class experiments explicitly NOT in this spec (earlier activation openai#1726, schedule smoothing openai#1663, position shift openai#1726). Those would be wasted spend per existing PG evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First smoke workflow (2026-04-21) was halted by execution because they saw 15 consecutive α grad_norm=0.0 log entries and matched this to the spec's stop-early criterion. BUT α is architecturally out-of-circuit until looping_active=True (at training_frac ≥ 0.35), so grad_norm=0 during the pre-looping phase is EXPECTED, not a bug. The smoke was actually clean — 500 iters no NaN, identity-at-init preserved, compile OK. Spec's wording was the problem. Fixes: 1. Add ⚠️ CRITICAL banner at top of spec, explicitly calling out the pre-looping-activation expectation. Includes a table mapping smoke/screen phase to correct grad_norm interpretation. 2. Rewrite stop-early criteria to explicitly condition on looping_active=True. Zero-grad pre-activation is expected. 3. Add smoke protocol requiring ENABLE_LOOPING_AT=0 OVERRIDE for the smoke (forces looping active, enables α plumbing check in 500 iters). 4. Explicit note: do NOT propagate smoke override to real screen. openai#1739 / openai#1726 evidence: step-0 activation is catastrophic. 5. Document the prior-incident failure mode so execution doesn't repeat the same false-positive halt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lier) Tests a third axis of the (compute, shape, position) eval-time sweep grid: BAND POSITION at fixed compute. Pattern '2,3,4,2,3,4, 2,3,4,5,6,7' = 17 layer-passes; loop band shifted from canonical {3,4,5} to {2,3,4}. Layer 2 — trained for single-visit feedforward — is now visited 3 times. Layer 5 drops from 3 visits to 1. Tests whether recurrence transferability extends to earlier layers, or whether the canonical band {3,4,5} is positionally tuned. Memory's openai#1726 tested wider bands {2..7} (catastrophic +0.163) but never tested {2,3,4} specifically with NL=2 contiguous — this is a real empirical hole. Compile audit: same as 080-088 — eval-only, LOOP_PATTERN at __init__, one new graph variant on first eval call (different iteration order), no mid-run recompile. Bank weights load layer-indexed identically; layer 2's bank reused on its 3 visits without modification. Cost ~$1, ~10 min on 4xH100. Eval-only, no TTT/GPTQ. W10 brainstorm appended: - MM cluster: TTT-LR sweep (config-only candidate for W11) - NN cluster: sliding-window attention only on loop layers (code change, defer) - OO cluster: lane-split during loop band (param-budget issues at full rank, reduced version is future work)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Depth Recurrence Sweep — Systematic Layer Loop Ablation#1726

Non-record: Depth Recurrence Sweep — Systematic Layer Loop Ablation#1726
krishs0404 wants to merge 1 commit intoopenai:mainfrom
krishs0404:depth-recurrence-sweep

krishs0404 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

krishs0404 commented Apr 18, 2026

Summary

Results

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant