Non-record: 4-Hour Progressive Depth — val_bpb 1.0889 by iverbovoy · Pull Request #895 · openai/parameter-golf

iverbovoy · 2026-03-26T20:25:58Z

Summary

Depth recurrence scaling study — first data point on how shared-weight recurrence scales with extended compute.

4 hours on 8xH100, 132K steps, 3 shared blocks with progressive depth (2→3→4→5 repeats, 15 effective layers)
38 SWA checkpoints, Hedge Mixer eval (adapted from Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688, Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745)
Builds on PRs Depth Recurrence + Cross-Repeat Skip + Sliding Window Eval #148, Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065) #784, Progressive Depth Training — val_bpb 1.1980 #835, Progressive Depth + Hedge Mixer — val_bpb 1.1454 #856

Results

Eval	val_bpb
Roundtrip	1.1613
Sliding window	1.1271
Sliding + Hedge Mixer	1.0889

vs existing unlimited compute entries:

Will DePue 4-hour flat baseline: 1.2074 → ours is -0.119 better
Ciprian-Florin Ifrim 2-hour 1-bit quant: 1.1239 → ours is -0.035 better

Key Finding

Shared-weight recurrence scales differently than flat architectures. At 132K steps with 5 repeats, each of the 3 blocks saw ~660K effective gradient passes. Progressive depth enables 5 repeats (15 effective layers) from 3 physical blocks — impossible to ramp dynamically with unique-layer architectures.

SWA at scale is massive: 38 checkpoints gave -0.060 bpb — larger than any single architectural change.

Scaling Curve

Steps	Phase	val_bpb
5K	2 rep	1.306
55K	2 rep	1.265
85K	3 rep	1.244
110K	4 rep	1.232
125K	5 rep	1.218
132K	5 rep + SWA	1.158

Test plan

Full 4-hour run on 8xH100 (14400s wallclock)
All 4 phase transitions completed without errors
Artifact 15.82MB < 16MB limit
Hedge Mixer eval 696s (score-first, no training data access)

iverbovoy · 2026-04-05T14:49:54Z

Superseded by #1384 — clean submission with 3-seed validation.

iverbovoy · 2026-04-05T17:01:35Z

Reopened — this is a separate unlimited compute track submission, not superseded by #1384.

iverbovoy · 2026-04-20T09:49:12Z

4-hour track non-record, depth-recurrence scaling

(Edit: cleaned up this PR — now contains only the 4-hour submission dir, 4 files, no accumulated cruft from sibling PRs or baseline-file changes. Earlier version had unrelated 10-min submissions and modified top-level files, which may have been confusing during review.)

Matches a direct "Request for PRs" from the README

The repo README lists under Requests for PRs:

Universal transformer — We have lots of depth recurrence submissions, but I'd love to see one 4 hour.

This PR is exactly that. Same 3-shared-block architecture as #1453 (10-min, val_bpb 1.1324), extended to 4 hours on 8×H100: val_bpb 1.0889 after 132K steps with a progressive-depth schedule 2→3→4→5 repeats (15 effective layers from 3 physical blocks), 38 SWA checkpoints, Hedge Mixer eval.

Current 4-hour non-record leaderboard

Author	Score	Approach	Date
Will DePue	1.2074	4-hour flat baseline (9×512)	2026-03-18
Ciprian-Florin Ifrim	1.1239	2-hour 1-bit quantization (106M params)	2026-03-24
This PR (#895)	1.0889	4-hour depth recurrence (3 shared blocks × 5 repeats)	2026-03-26 — open

This beats both existing entries:

−0.119 bpb vs Will DePue's 4-hour flat baseline
−0.035 bpb vs Ciprian-Florin Ifrim's 1-bit submission (and on twice the compute to that entry)

Key observation from the scale-up

At 132K steps with 5 repeats, each of the 3 physical blocks saw ~660K effective gradient passes. Progressive depth (ramping 2 → 3 → 4 → 5 repeats during training) is only possible with shared-weight recurrence — flat architectures can't ramp depth mid-training. SWA at this scale contributed −0.060 bpb across 38 checkpoints, larger than any single architectural change we tested.

Scaling curve

Steps	Phase	val_bpb
5K	2 rep	1.306
55K	2 rep	1.265
85K	3 rep	1.244
110K	4 rep	1.232
125K	5 rep	1.218
132K	5 rep + SWA	1.158
+ Hedge Mixer eval		1.0889

Compliance

The closely-related 10-min PR #1453 uses the same architecture + Hedge Mixer implementation and received a community compliance review from @MatoTeziTanka on 2026-04-12 — LOOKS CLEAN, no illegal hash pattern, legal score-first eval.

Summary

Full retrospective, all 7 related PRs, experiments catalog, reproduction config: SUMMARY.md.

Would appreciate a look from @cocohearts @valerio-oai @0hq @willdepue to determine if this fits the non-record track criteria, and whether the "Universal transformer 4-hour" request is considered satisfied. Happy to address any feedback on the submission format.

3 shared blocks with progressive depth (2->3->4->5 repeats, 15 effective layers), 132K steps on 8xH100, 38 SWA checkpoints, Hedge Mixer eval. Architecture is the same recurrent design as 10-min submission openai#1453 (val_bpb 1.1324). This PR is the 4-hour companion exploring how shared-weight recurrence scales with extended compute. Beats existing non-record 4-hour entries: - Will DePue 4-hour flat baseline (1.2074): -0.119 better - Ciprian-Florin Ifrim 2-hour 1-bit (1.1239): -0.035 better

vimeto mentioned this pull request Mar 29, 2026

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342 #1096

Draft

iverbovoy closed this Apr 5, 2026

iverbovoy reopened this Apr 5, 2026

iverbovoy force-pushed the 4hour-run branch from 0b56a4d to 6a9f1d1 Compare April 20, 2026 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895
iverbovoy wants to merge 1 commit intoopenai:mainfrom
iverbovoy:4hour-run

iverbovoy commented Mar 26, 2026

Uh oh!

iverbovoy commented Apr 5, 2026

Uh oh!

iverbovoy commented Apr 5, 2026

Uh oh!

iverbovoy commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iverbovoy commented Mar 26, 2026

Summary

Results

Key Finding

Scaling Curve

Test plan

Uh oh!

iverbovoy commented Apr 5, 2026

Uh oh!

iverbovoy commented Apr 5, 2026

Uh oh!

iverbovoy commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

4-hour track non-record, depth-recurrence scaling

Matches a direct "Request for PRs" from the README

Current 4-hour non-record leaderboard

Key observation from the scale-up

Scaling curve

Compliance

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iverbovoy commented Apr 20, 2026 •

edited

Loading