Skip to content

Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study#1371

Merged
valerio-oai merged 5 commits intoopenai:mainfrom
aarjunsrinivasan:gdn_long_context
May 3, 2026
Merged

Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study#1371
valerio-oai merged 5 commits intoopenai:mainfrom
aarjunsrinivasan:gdn_long_context

Conversation

@aarjunsrinivasan
Copy link
Copy Markdown
Contributor

Summary

Tests whether replacing most attention layers with Gated DeltaNet (GDN) linear-recurrent layers — the Olmo Hybrid design (Merrill et al., 2026) — benefits a 17M-parameter model at long context. Key finding: there is a clear crossover between 8k and 16k context where the hybrid switches from losing to winning against a full-attention baseline of equal size, within an identical 600-second wall-clock budget on a single H100.

Seq Len Baseline val_bpb Hybrid val_bpb Δ Winner
8192 1.3507 1.3810 +0.030 Baseline
16384 1.4353 1.3999 (2-seed) −0.035 Hybrid
32768 1.7059 1.4709 (2-seed) −0.235 Hybrid

All scores are final_int8_zlib_roundtrip val_bpb. The hybrid avoids the baseline's blow-up at 32k (1.7059 → 1.4709) by completing 62% more optimizer steps in the same wall-clock, and has better per-step loss at every context length tested.

Methodological approach

Architecture design

The hybrid architecture adapts the Olmo Hybrid design (Merrill et al., 2026) to the parameter-golf regime:

  • Core idea from Olmo Hybrid: interleave Gated DeltaNet (GDN) linear-recurrent layers with full-attention layers at a 3:1 ratio. In the original paper, this replaces the sliding-window attention layers from Olmo 3, motivated by both greater expressivity (hybrid models can express tasks beyond transformers or RNNs alone) and subquadratic inference cost.
  • Implementation: Uses the chunk_gated_delta_rule Triton kernel from the flash-linear-attention (FLA) library, plus FLA's causal_conv1d for the short depthwise convolutions on Q/K/V.
  • Adaptation to 17M params: Width reduced from 512 → 448 to keep parameter count comparable (~17.42M vs ~17.06M, +2%). GDN uses head_dim_ratio=0.75 and expand_v=2.0 (from the Olmo Hybrid configuration) so the smaller head dim offsets the doubled value dimension. Attention layers placed at indices 3 and 7 (approximately every 4th layer in a 9-layer model).

Baseline

The full-attention baseline (train_gpt.py) is the standard parameter-golf architecture: 9-layer U-Net transformer, width 512, 8Q/4KV GQA heads, 2× MLP, RoPE, RMSNorm, FlashAttention, tied embeddings.

Optimizer recipe

Both models use the same Muon optimizer recipe adapted from the TrainingOptSeq4096 record submission , PR #52 :

  • Muon momentum 0.99, warmup from 0.92
  • LR: tied_embed=0.030, matrix=0.020, scalar=0.020
  • Warmup/warmdown ratios preserved from the 4k record and recalculated per context length based on measured step times

Experimental procedure

  1. Probe runs (documented in compute_experiment.md): Short 100-second runs at 4k, 8k, 16k, and 32k measured step times for both architectures, identifying the step-time crossover between 8k and 16k.
  2. Full 600s experiments: Both architectures trained at 8k, 16k, and 32k on 1×H100 SXM with MAX_WALLCLOCK_SECONDS=600. Optimizer schedules (warmup steps, warmdown iters) recalculated per config to preserve training dynamics.
  3. Multi-seed runs: Hybrid 16k and 32k each run with 2 seeds to quantify variance. Inter-seed spread: ±0.003 bpb at 16k, ±0.001 bpb at 32k.
  4. Evaluation: Standard final_int8_zlib_roundtrip — int8 quantization + zlib compression + round-trip reload, evaluated on the full FineWeb validation split.

What's included

  • README.md — full writeup with motivation, architecture, results, analysis, and reproduction instructions
  • submission.json — metadata with 2-seed results for the 32k headline number
  • train_gpt.py — baseline model (full-attention)
  • train_gpt_gdn.py — hybrid GDN model
  • run_hybrid_long_context_single_h100.sh / run_hybrid_long_context_all_single_h100.sh — hybrid reproduction scripts
  • run_baseline_long_context_single_h100.sh — baseline reproduction script
  • requirements.txt — extra dependency (flash-linear-attention)
  • compute_experiment.md — early probe runs that motivated the full experiment
  • logs_experiment/ — 8 log files (3 baseline × {8k, 16k, 32k} + 5 hybrid × {8k, 16k×2 seeds, 32k×2 seeds})

References

  • Olmo Hybrid: Merrill et al., "Olmo Hybrid: From Theory to Practice," AI2, 2026. Architecture design, 3:1 GDN:attention ratio, interleaved placement strategy.
  • flash-linear-attention: fla-org/flash-linear-attention. Triton kernels for chunk_gated_delta_rule and causal_conv1d.
  • TrainingOptSeq4096: Spokane Way, parameter-golf record submission, 2026-03-19. Optimizer recipe (Muon momentum schedule, LR, warmup/warmdown ratios) adapted for this work.

Submission checklist

  • Non-record track (track_non_record_16mb)
  • README with detailed justification
  • submission.json with val_bpb, val_loss, seed results
  • Training scripts compile and run from repo root
  • Training logs for all reported configurations
  • Artifact fits in 16MB (14.06MB compressed)

aarjunsrinivasan and others added 5 commits April 3, 2026 23:52
train_gpt_gdn.py: 9-layer hybrid with 7 GDN (Gated Delta Net) recurrent
layers and 2 full-attention layers (indices 3, 7), width 448, ~17.4M params.
train_gpt.py: full-attention baseline, width 512, ~17.1M params.
requirements.txt: adds flash-linear-attention (fla) dependency for
chunk_gated_delta_rule kernel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scripts for launching hybrid (single context length, all three lengths)
and baseline runs on a single H100 with 600s wall-clock budget across
8k, 16k, and 32k sequence lengths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
submission.json: val_bpb 1.4709 (2-seed mean) at 32k context.
logs_experiment/: training logs for all baseline and hybrid runs across
8k/16k/32k sequence lengths.
README.md: full writeup with results table, analysis, and reproduction steps.
compute_experiment.md: step-time probe notes motivating the crossover analysis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@valerio-oai valerio-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selected for the notable non-record submissions section.

@valerio-oai valerio-oai merged commit d05378d into openai:main May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants