Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study by aarjunsrinivasan · Pull Request #1371 · openai/parameter-golf

aarjunsrinivasan · 2026-04-05T04:43:55Z

Summary

Tests whether replacing most attention layers with Gated DeltaNet (GDN) linear-recurrent layers — the Olmo Hybrid design (Merrill et al., 2026) — benefits a 17M-parameter model at long context. Key finding: there is a clear crossover between 8k and 16k context where the hybrid switches from losing to winning against a full-attention baseline of equal size, within an identical 600-second wall-clock budget on a single H100.

Seq Len	Baseline val_bpb	Hybrid val_bpb	Δ	Winner
8192	1.3507	1.3810	+0.030	Baseline
16384	1.4353	1.3999 (2-seed)	−0.035	Hybrid
32768	1.7059	1.4709 (2-seed)	−0.235	Hybrid

All scores are final_int8_zlib_roundtrip val_bpb. The hybrid avoids the baseline's blow-up at 32k (1.7059 → 1.4709) by completing 62% more optimizer steps in the same wall-clock, and has better per-step loss at every context length tested.

Methodological approach

Architecture design

The hybrid architecture adapts the Olmo Hybrid design (Merrill et al., 2026) to the parameter-golf regime:

Core idea from Olmo Hybrid: interleave Gated DeltaNet (GDN) linear-recurrent layers with full-attention layers at a 3:1 ratio. In the original paper, this replaces the sliding-window attention layers from Olmo 3, motivated by both greater expressivity (hybrid models can express tasks beyond transformers or RNNs alone) and subquadratic inference cost.
Implementation: Uses the chunk_gated_delta_rule Triton kernel from the flash-linear-attention (FLA) library, plus FLA's causal_conv1d for the short depthwise convolutions on Q/K/V.
Adaptation to 17M params: Width reduced from 512 → 448 to keep parameter count comparable (~17.42M vs ~17.06M, +2%). GDN uses head_dim_ratio=0.75 and expand_v=2.0 (from the Olmo Hybrid configuration) so the smaller head dim offsets the doubled value dimension. Attention layers placed at indices 3 and 7 (approximately every 4th layer in a 9-layer model).

Baseline

The full-attention baseline (train_gpt.py) is the standard parameter-golf architecture: 9-layer U-Net transformer, width 512, 8Q/4KV GQA heads, 2× MLP, RoPE, RMSNorm, FlashAttention, tied embeddings.

Optimizer recipe

Both models use the same Muon optimizer recipe adapted from the TrainingOptSeq4096 record submission , PR #52 :

Muon momentum 0.99, warmup from 0.92
LR: tied_embed=0.030, matrix=0.020, scalar=0.020
Warmup/warmdown ratios preserved from the 4k record and recalculated per context length based on measured step times

Experimental procedure

Probe runs (documented in compute_experiment.md): Short 100-second runs at 4k, 8k, 16k, and 32k measured step times for both architectures, identifying the step-time crossover between 8k and 16k.
Full 600s experiments: Both architectures trained at 8k, 16k, and 32k on 1×H100 SXM with MAX_WALLCLOCK_SECONDS=600. Optimizer schedules (warmup steps, warmdown iters) recalculated per config to preserve training dynamics.
Multi-seed runs: Hybrid 16k and 32k each run with 2 seeds to quantify variance. Inter-seed spread: ±0.003 bpb at 16k, ±0.001 bpb at 32k.
Evaluation: Standard final_int8_zlib_roundtrip — int8 quantization + zlib compression + round-trip reload, evaluated on the full FineWeb validation split.

What's included

README.md — full writeup with motivation, architecture, results, analysis, and reproduction instructions
submission.json — metadata with 2-seed results for the 32k headline number
train_gpt.py — baseline model (full-attention)
train_gpt_gdn.py — hybrid GDN model
run_hybrid_long_context_single_h100.sh / run_hybrid_long_context_all_single_h100.sh — hybrid reproduction scripts
run_baseline_long_context_single_h100.sh — baseline reproduction script
requirements.txt — extra dependency (flash-linear-attention)
compute_experiment.md — early probe runs that motivated the full experiment
logs_experiment/ — 8 log files (3 baseline × {8k, 16k, 32k} + 5 hybrid × {8k, 16k×2 seeds, 32k×2 seeds})

References

Olmo Hybrid: Merrill et al., "Olmo Hybrid: From Theory to Practice," AI2, 2026. Architecture design, 3:1 GDN:attention ratio, interleaved placement strategy.
flash-linear-attention: fla-org/flash-linear-attention. Triton kernels for chunk_gated_delta_rule and causal_conv1d.
TrainingOptSeq4096: Spokane Way, parameter-golf record submission, 2026-03-19. Optimizer recipe (Muon momentum schedule, LR, warmup/warmdown ratios) adapted for this work.

Submission checklist

Non-record track (track_non_record_16mb)
README with detailed justification
submission.json with val_bpb, val_loss, seed results
Training scripts compile and run from repo root
Training logs for all reported configurations
Artifact fits in 16MB (14.06MB compressed)

train_gpt_gdn.py: 9-layer hybrid with 7 GDN (Gated Delta Net) recurrent layers and 2 full-attention layers (indices 3, 7), width 448, ~17.4M params. train_gpt.py: full-attention baseline, width 512, ~17.1M params. requirements.txt: adds flash-linear-attention (fla) dependency for chunk_gated_delta_rule kernel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Scripts for launching hybrid (single context length, all three lengths) and baseline runs on a single H100 with 600s wall-clock budget across 8k, 16k, and 32k sequence lengths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

submission.json: val_bpb 1.4709 (2-seed mean) at 32k context. logs_experiment/: training logs for all baseline and hybrid runs across 8k/16k/32k sequence lengths. README.md: full writeup with results table, analysis, and reproduction steps. compute_experiment.md: step-time probe notes motivating the crossover analysis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

valerio-oai

Selected for the notable non-record submissions section.

aarjunsrinivasan and others added 5 commits April 3, 2026 23:52

gdn implementation

59a4ab8

wrong file placement

c33da72

valerio-oai approved these changes May 3, 2026

View reviewed changes

valerio-oai merged commit d05378d into openai:main May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study#1371

Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study#1371
valerio-oai merged 5 commits intoopenai:mainfrom
aarjunsrinivasan:gdn_long_context

aarjunsrinivasan commented Apr 5, 2026

Uh oh!

valerio-oai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aarjunsrinivasan commented Apr 5, 2026

Summary

Methodological approach

Architecture design

Baseline

Optimizer recipe

Experimental procedure

What's included

References

Submission checklist

Uh oh!

valerio-oai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants