Skip to content

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895

Open
iverbovoy wants to merge 1 commit intoopenai:mainfrom
iverbovoy:4hour-run
Open

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895
iverbovoy wants to merge 1 commit intoopenai:mainfrom
iverbovoy:4hour-run

Conversation

@iverbovoy
Copy link
Copy Markdown

Summary

Depth recurrence scaling study — first data point on how shared-weight recurrence scales with extended compute.

Results

Eval val_bpb
Roundtrip 1.1613
Sliding window 1.1271
Sliding + Hedge Mixer 1.0889

vs existing unlimited compute entries:

  • Will DePue 4-hour flat baseline: 1.2074 → ours is -0.119 better
  • Ciprian-Florin Ifrim 2-hour 1-bit quant: 1.1239 → ours is -0.035 better

Key Finding

Shared-weight recurrence scales differently than flat architectures. At 132K steps with 5 repeats, each of the 3 blocks saw ~660K effective gradient passes. Progressive depth enables 5 repeats (15 effective layers) from 3 physical blocks — impossible to ramp dynamically with unique-layer architectures.

SWA at scale is massive: 38 checkpoints gave -0.060 bpb — larger than any single architectural change.

Scaling Curve

Steps Phase val_bpb
5K 2 rep 1.306
55K 2 rep 1.265
85K 3 rep 1.244
110K 4 rep 1.232
125K 5 rep 1.218
132K 5 rep + SWA 1.158

Test plan

  • Full 4-hour run on 8xH100 (14400s wallclock)
  • All 4 phase transitions completed without errors
  • Artifact 15.82MB < 16MB limit
  • Hedge Mixer eval 696s (score-first, no training data access)

@iverbovoy
Copy link
Copy Markdown
Author

Superseded by #1384 — clean submission with 3-seed validation.

@iverbovoy iverbovoy closed this Apr 5, 2026
@iverbovoy
Copy link
Copy Markdown
Author

Reopened — this is a separate unlimited compute track submission, not superseded by #1384.

@iverbovoy iverbovoy reopened this Apr 5, 2026
@iverbovoy
Copy link
Copy Markdown
Author

iverbovoy commented Apr 20, 2026

4-hour track non-record, depth-recurrence scaling

(Edit: cleaned up this PR — now contains only the 4-hour submission dir, 4 files, no accumulated cruft from sibling PRs or baseline-file changes. Earlier version had unrelated 10-min submissions and modified top-level files, which may have been confusing during review.)

Matches a direct "Request for PRs" from the README

The repo README lists under Requests for PRs:

Universal transformerWe have lots of depth recurrence submissions, but I'd love to see one 4 hour.

This PR is exactly that. Same 3-shared-block architecture as #1453 (10-min, val_bpb 1.1324), extended to 4 hours on 8×H100: val_bpb 1.0889 after 132K steps with a progressive-depth schedule 2→3→4→5 repeats (15 effective layers from 3 physical blocks), 38 SWA checkpoints, Hedge Mixer eval.

Current 4-hour non-record leaderboard

Author Score Approach Date
Will DePue 1.2074 4-hour flat baseline (9×512) 2026-03-18
Ciprian-Florin Ifrim 1.1239 2-hour 1-bit quantization (106M params) 2026-03-24
This PR (#895) 1.0889 4-hour depth recurrence (3 shared blocks × 5 repeats) 2026-03-26 — open

This beats both existing entries:

  • −0.119 bpb vs Will DePue's 4-hour flat baseline
  • −0.035 bpb vs Ciprian-Florin Ifrim's 1-bit submission (and on twice the compute to that entry)

Key observation from the scale-up

At 132K steps with 5 repeats, each of the 3 physical blocks saw ~660K effective gradient passes. Progressive depth (ramping 2 → 3 → 4 → 5 repeats during training) is only possible with shared-weight recurrence — flat architectures can't ramp depth mid-training. SWA at this scale contributed −0.060 bpb across 38 checkpoints, larger than any single architectural change we tested.

Scaling curve

Steps Phase val_bpb
5K 2 rep 1.306
55K 2 rep 1.265
85K 3 rep 1.244
110K 4 rep 1.232
125K 5 rep 1.218
132K 5 rep + SWA 1.158
+ Hedge Mixer eval 1.0889

Compliance

The closely-related 10-min PR #1453 uses the same architecture + Hedge Mixer implementation and received a community compliance review from @MatoTeziTanka on 2026-04-12 — LOOKS CLEAN, no illegal hash pattern, legal score-first eval.

Summary

Full retrospective, all 7 related PRs, experiments catalog, reproduction config: SUMMARY.md.

Would appreciate a look from @cocohearts @valerio-oai @0hq @willdepue to determine if this fits the non-record track criteria, and whether the "Universal transformer 4-hour" request is considered satisfied. Happy to address any feedback on the submission format.

3 shared blocks with progressive depth (2->3->4->5 repeats, 15 effective layers),
132K steps on 8xH100, 38 SWA checkpoints, Hedge Mixer eval.

Architecture is the same recurrent design as 10-min submission openai#1453 (val_bpb 1.1324).
This PR is the 4-hour companion exploring how shared-weight recurrence scales with
extended compute.

Beats existing non-record 4-hour entries:
- Will DePue 4-hour flat baseline (1.2074): -0.119 better
- Ciprian-Florin Ifrim 2-hour 1-bit (1.1239): -0.035 better
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant