Commit d4e43e3
Add 6 proven SOTA techniques to base (iterations 1-4)
Architecture (from PR openai#1394/openai#1412/openai#1493):
- 11 layers (was 9), 4x MLP (was 2x), seq_len 2048 (was 1024)
- LeakyReLU(0.5)^2 activation (was ReLU^2)
- 3-layer depth recurrence (L3-5 looped 2 extra times, 17 virtual layers)
- Parallel residuals GPT-J style on layers >= 7
- XSA (exclusive self-attention) on last 11 layers
- Skip gates (learned sigmoid gating on skip connections)
- LN scale factor (1/sqrt(layer_idx+1) per-layer normalization)
- Partial RoPE (rope_dims=16, rest pass-through)
Training (from PR openai#1493):
- QK-Gain 5.25 (was 1.5)
- EMA weight averaging (decay=0.9965)
- Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown_frac=0.72
- Grad clip norm 0.3 (was 0)
- Muon momentum warmup 0.92->0.99 over 1500 steps
- Loop warmup (second warmup phase with looping active)
- Orthogonal weight init for large matrices
Still using base int8+zlib quantization (GPTQ SDClip upgrade next).
Still using SP1024 data (SP8192 blocked on data availability).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 75700cb commit d4e43e3
1 file changed
Lines changed: 226 additions & 142 deletions
0 commit comments