Commit d4e43e3

Chidera Ibe

and

committed

Add 6 proven SOTA techniques to base (iterations 1-4)

Architecture (from PR openai#1394/openai#1412/openai#1493): - 11 layers (was 9), 4x MLP (was 2x), seq_len 2048 (was 1024) - LeakyReLU(0.5)^2 activation (was ReLU^2) - 3-layer depth recurrence (L3-5 looped 2 extra times, 17 virtual layers) - Parallel residuals GPT-J style on layers >= 7 - XSA (exclusive self-attention) on last 11 layers - Skip gates (learned sigmoid gating on skip connections) - LN scale factor (1/sqrt(layer_idx+1) per-layer normalization) - Partial RoPE (rope_dims=16, rest pass-through) Training (from PR openai#1493): - QK-Gain 5.25 (was 1.5) - EMA weight averaging (decay=0.9965) - Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown_frac=0.72 - Grad clip norm 0.3 (was 0) - Muon momentum warmup 0.92->0.99 over 1500 steps - Loop warmup (second warmup phase with looping active) - Orthogonal weight init for large matrices Still using base int8+zlib quantization (GPTQ SDClip upgrade next). Still using SP1024 data (SP8192 blocked on data availability). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1 parent 75700cb commit d4e43e3Copy full SHA for d4e43e3

1 file changed

train_gpt.py

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit d4e43e3

File tree

0 commit comments