Skip to content

adaLN_recurrence [val_bpb=1.255 on 4 x H100]#1944

Open
dmitriymyan1 wants to merge 1 commit intoopenai:mainfrom
dmitriymyan1:adaLN_recurrence
Open

adaLN_recurrence [val_bpb=1.255 on 4 x H100]#1944
dmitriymyan1 wants to merge 1 commit intoopenai:mainfrom
dmitriymyan1:adaLN_recurrence

Conversation

@dmitriymyan1
Copy link
Copy Markdown

Summary

  • Adds adaLN (adaptive layer norm) conditioned on recurrence iteration to the Parallel Residuals + Mini Depth Recurrence baseline
  • Allows weight-tied recurrent layers (4, 5) to distinguish their first vs second pass via lightweight per-channel affine modulation (~6.6K extra parameters, ~zero compute overhead)
  • Zero-initialized projection ensures training starts identically to the baseline

Early Result

Smoke-test on 4×H100 / 600s (50% of submission compute): val_bpb 1.2551 (val_loss 2.1193 nats), 15.26 MB quantized artifact. Only ~400 recurrent training steps ran before wallclock cap — loss curve still descending cleanly at cutoff. Full 8×H100 run pending.

Files

  • train_gpt.py — training script with adaLN support (FILM_ENABLED=1)
  • README.md — approach description and reproducibility instructions
  • requirements.txt — dependencies (adds brotli)

Add adaLN (adaptive layer norm) conditioned on recurrence iteration to the
Parallel Residuals + Mini Depth Recurrence baseline. Allows weight-tied
recurrent layers to distinguish first vs second pass with ~zero compute
overhead (~6.6K extra parameters).

Early result: val_bpb 1.2551 on 4xH100/600s (half compute, only ~400
recurrent steps before wallclock cap).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant