Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study#1371
Merged
valerio-oai merged 5 commits intoopenai:mainfrom May 3, 2026
Merged
Conversation
train_gpt_gdn.py: 9-layer hybrid with 7 GDN (Gated Delta Net) recurrent layers and 2 full-attention layers (indices 3, 7), width 448, ~17.4M params. train_gpt.py: full-attention baseline, width 512, ~17.1M params. requirements.txt: adds flash-linear-attention (fla) dependency for chunk_gated_delta_rule kernel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scripts for launching hybrid (single context length, all three lengths) and baseline runs on a single H100 with 600s wall-clock budget across 8k, 16k, and 32k sequence lengths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
submission.json: val_bpb 1.4709 (2-seed mean) at 32k context. logs_experiment/: training logs for all baseline and hybrid runs across 8k/16k/32k sequence lengths. README.md: full writeup with results table, analysis, and reproduction steps. compute_experiment.md: step-time probe notes motivating the crossover analysis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
valerio-oai
approved these changes
May 3, 2026
Contributor
valerio-oai
left a comment
There was a problem hiding this comment.
Selected for the notable non-record submissions section.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tests whether replacing most attention layers with Gated DeltaNet (GDN) linear-recurrent layers — the Olmo Hybrid design (Merrill et al., 2026) — benefits a 17M-parameter model at long context. Key finding: there is a clear crossover between 8k and 16k context where the hybrid switches from losing to winning against a full-attention baseline of equal size, within an identical 600-second wall-clock budget on a single H100.
All scores are
final_int8_zlib_roundtripval_bpb. The hybrid avoids the baseline's blow-up at 32k (1.7059 → 1.4709) by completing 62% more optimizer steps in the same wall-clock, and has better per-step loss at every context length tested.Methodological approach
Architecture design
The hybrid architecture adapts the Olmo Hybrid design (Merrill et al., 2026) to the parameter-golf regime:
chunk_gated_delta_ruleTriton kernel from the flash-linear-attention (FLA) library, plus FLA'scausal_conv1dfor the short depthwise convolutions on Q/K/V.head_dim_ratio=0.75andexpand_v=2.0(from the Olmo Hybrid configuration) so the smaller head dim offsets the doubled value dimension. Attention layers placed at indices 3 and 7 (approximately every 4th layer in a 9-layer model).Baseline
The full-attention baseline (
train_gpt.py) is the standard parameter-golf architecture: 9-layer U-Net transformer, width 512, 8Q/4KV GQA heads, 2× MLP, RoPE, RMSNorm, FlashAttention, tied embeddings.Optimizer recipe
Both models use the same Muon optimizer recipe adapted from the TrainingOptSeq4096 record submission , PR #52 :
Experimental procedure
compute_experiment.md): Short 100-second runs at 4k, 8k, 16k, and 32k measured step times for both architectures, identifying the step-time crossover between 8k and 16k.MAX_WALLCLOCK_SECONDS=600. Optimizer schedules (warmup steps, warmdown iters) recalculated per config to preserve training dynamics.final_int8_zlib_roundtrip— int8 quantization + zlib compression + round-trip reload, evaluated on the full FineWeb validation split.What's included
README.md— full writeup with motivation, architecture, results, analysis, and reproduction instructionssubmission.json— metadata with 2-seed results for the 32k headline numbertrain_gpt.py— baseline model (full-attention)train_gpt_gdn.py— hybrid GDN modelrun_hybrid_long_context_single_h100.sh/run_hybrid_long_context_all_single_h100.sh— hybrid reproduction scriptsrun_baseline_long_context_single_h100.sh— baseline reproduction scriptrequirements.txt— extra dependency (flash-linear-attention)compute_experiment.md— early probe runs that motivated the full experimentlogs_experiment/— 8 log files (3 baseline × {8k, 16k, 32k} + 5 hybrid × {8k, 16k×2 seeds, 32k×2 seeds})References
chunk_gated_delta_ruleandcausal_conv1d.Submission checklist
track_non_record_16mb)submission.jsonwith val_bpb, val_loss, seed results