Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895
Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895iverbovoy wants to merge 1 commit intoopenai:mainfrom
Conversation
|
Superseded by #1384 — clean submission with 3-seed validation. |
|
Reopened — this is a separate unlimited compute track submission, not superseded by #1384. |
4-hour track non-record, depth-recurrence scaling(Edit: cleaned up this PR — now contains only the 4-hour submission dir, 4 files, no accumulated cruft from sibling PRs or baseline-file changes. Earlier version had unrelated 10-min submissions and modified top-level files, which may have been confusing during review.) Matches a direct "Request for PRs" from the READMEThe repo README lists under Requests for PRs:
This PR is exactly that. Same 3-shared-block architecture as #1453 (10-min, val_bpb 1.1324), extended to 4 hours on 8×H100: val_bpb 1.0889 after 132K steps with a progressive-depth schedule 2→3→4→5 repeats (15 effective layers from 3 physical blocks), 38 SWA checkpoints, Hedge Mixer eval. Current 4-hour non-record leaderboard
This beats both existing entries:
Key observation from the scale-upAt 132K steps with 5 repeats, each of the 3 physical blocks saw ~660K effective gradient passes. Progressive depth (ramping 2 → 3 → 4 → 5 repeats during training) is only possible with shared-weight recurrence — flat architectures can't ramp depth mid-training. SWA at this scale contributed −0.060 bpb across 38 checkpoints, larger than any single architectural change we tested. Scaling curve
ComplianceThe closely-related 10-min PR #1453 uses the same architecture + Hedge Mixer implementation and received a community compliance review from @MatoTeziTanka on 2026-04-12 — LOOKS CLEAN, no illegal hash pattern, legal score-first eval. SummaryFull retrospective, all 7 related PRs, experiments catalog, reproduction config: Would appreciate a look from @cocohearts @valerio-oai @0hq @willdepue to determine if this fits the non-record track criteria, and whether the "Universal transformer 4-hour" request is considered satisfied. Happy to address any feedback on the submission format. |
3 shared blocks with progressive depth (2->3->4->5 repeats, 15 effective layers), 132K steps on 8xH100, 38 SWA checkpoints, Hedge Mixer eval. Architecture is the same recurrent design as 10-min submission openai#1453 (val_bpb 1.1324). This PR is the 4-hour companion exploring how shared-weight recurrence scales with extended compute. Beats existing non-record 4-hour entries: - Will DePue 4-hour flat baseline (1.2074): -0.119 better - Ciprian-Florin Ifrim 2-hour 1-bit (1.1239): -0.035 better
Summary
Depth recurrence scaling study — first data point on how shared-weight recurrence scales with extended compute.
Results
vs existing unlimited compute entries:
Key Finding
Shared-weight recurrence scales differently than flat architectures. At 132K steps with 5 repeats, each of the 3 blocks saw ~660K effective gradient passes. Progressive depth enables 5 repeats (15 effective layers) from 3 physical blocks — impossible to ramp dynamically with unique-layer architectures.
SWA at scale is massive: 38 checkpoints gave -0.060 bpb — larger than any single architectural change.
Scaling Curve
Test plan