Skip to content

Commit 73de3a6

Browse files
abaybektursunclaude
authored andcommitted
Record: LeakyReLU² + Legal TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)
LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0d1fe77 commit 73de3a6

6 files changed

Lines changed: 2857 additions & 0 deletions

File tree

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# LeakyReLU² + Legal Score-First TTT + Parallel Muon
2+
3+
**val_bpb: 1.1194** (3-seed mean, std 0.0006) | **~15.95 MB** | 8×H100 SXM
4+
5+
## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
6+
7+
| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact |
8+
|------|----------|-------|-------------|-----------------|----------|----------|----------|
9+
| 1337 | 83.3ms | 7,179 | 1.1218 | **1.1192** | -0.0026 | 410s | 15,977,386 |
10+
| 42 | 83.4ms | 7,182 | 1.1224 | **1.1200** | -0.0024 | 408s | 15,876,510 |
11+
| 2025 | 83.4ms | 7,188 | 1.1218 | **1.1189** | -0.0029 | 408s | 15,990,006 |
12+
| **Mean** | **83.4ms** | **7,183** | **1.1220** | **1.1194 (std 0.0006)** | **-0.0026** | **~409s** | |
13+
14+
## Key Innovation: LeakyReLU(0.5)²
15+
16+
One-line activation change that delivers -0.003 BPB:
17+
18+
```python
19+
# Standard (relu²)
20+
x = torch.relu(self.fc(x)).square()
21+
22+
# This submission (leaky relu²)
23+
x = F.leaky_relu(self.fc(x), negative_slope=0.5).square()
24+
```
25+
26+
LeakyReLU with slope 0.5 preserves negative gradient flow through the MLP, allowing the model to learn from both positive and negative pre-activations. The squaring step still produces non-negative outputs, maintaining the relu² inductive bias while eliminating dead neurons.
27+
28+
This activation is used in PR #493 (ablated at -0.003 BPB) and PR #518 (part of their 1.0622 record submission).
29+
30+
## Legal TTT Protocol
31+
32+
Backward-looking, score-first TTT following PR #461's framework:
33+
34+
1. Val tokens split into 1,893 non-overlapping 32K-token chunks
35+
2. **For each chunk**:
36+
- **SCORE**: Sliding window eval under `torch.inference_mode()` — no gradients, no weight mutation possible
37+
- **TRAIN**: SGD(lr=0.002, momentum=0.9) on the already-scored chunk. 3 epochs, all blocks unfrozen, cosine LR decay, grad clip 1.0
38+
3. Last chunk scored but never trained on
39+
4. Chunk N scored by model adapted only on chunks 0..N-1
40+
41+
`inference_mode()` is a PyTorch context manager that disables gradient tracking and prohibits in-place weight mutation, providing a hard guarantee that scoring is stateless.
42+
43+
### TTT Hyperparameters
44+
45+
| Parameter | Value |
46+
|-----------|-------|
47+
| Chunk size | 32,768 tokens |
48+
| Optimizer | SGD + momentum(0.9) |
49+
| Learning rate | 0.002 (cosine decay across chunks) |
50+
| Epochs per chunk | 3 |
51+
| Frozen blocks | None (all blocks adapt) |
52+
| Gradient clip | 1.0 |
53+
54+
### Timing Budget
55+
56+
| Phase | Time |
57+
|-------|------|
58+
| Training | 600s (≤10 min) |
59+
| Standard eval (int6 roundtrip + sliding window) | ~120s |
60+
| Legal TTT (score-first sliding + adaptation) | ~410s |
61+
| **Total eval** | **~530s (< 10 min)** |
62+
63+
## Training Architecture
64+
65+
PR #414 stack with Parameter Banking + Parallel Muon (PR #399):
66+
67+
| Component | Setting |
68+
|-----------|---------|
69+
| Layers | 11 (512d, 8H, 4KV) |
70+
| MLP | 3× with **LeakyReLU(0.5)²** |
71+
| BigramHash | 1536 |
72+
| XSA | Last 4 layers |
73+
| RoPE | Partial (16/64 dims) |
74+
| LN Scale | 1/√(layer+1) |
75+
| VE128 | Layers 9-10 |
76+
| Weight avg | EMA(0.997) + Tight SWA(every 50) |
77+
| Quantization | GPTQ-lite int6 + lzma |
78+
| Optimizer | Parameter Banking + Parallel Muon |
79+
80+
### Parameter Banking + Parallel Muon
81+
82+
First introduced in [PR #399](https://github.com/openai/parameter-golf/pull/399):
83+
84+
- 4 contiguous 3D `nn.Parameter` banks replace 66 separate `nn.Linear` weights
85+
- Batched Newton-Schulz orthogonalization via `torch.bmm`
86+
- DDP removed for banks; async reduce-scatter → local NS → async all-gather
87+
- 83.3ms/step vs ~85ms baseline
88+
89+
## Run Command
90+
91+
```bash
92+
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
93+
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
94+
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
95+
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
96+
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
97+
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
98+
MUON_WD=0.04 ADAM_WD=0.04 \
99+
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
100+
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
101+
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
102+
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
103+
SEED=1337 \
104+
torchrun --standalone --nproc_per_node=8 train_gpt.py
105+
```
106+
107+
## Ablation
108+
109+
Incremental contribution of each technique (all seed 1337):
110+
111+
| Change | Pre-TTT bpb | Post-TTT bpb | Delta |
112+
|--------|-------------|-------------|-------|
113+
| PR #414 base (relu², BIGRAM=2048) | 1.1234 |||
114+
| + Parameter Banking + Parallel Muon | 1.1234 || ±0.0000 |
115+
| + Legal TTT (3ep, freeze=2) || 1.1217 | -0.0017 |
116+
| + TTT freeze=0 (all blocks) || 1.1213 | -0.0004 |
117+
| + BigramHash 2048→3072 || 1.1204 | -0.0009 |
118+
| + **LeakyReLU(0.5)²** | 1.1213 | **1.1183** | **-0.0021** |
119+
120+
## Credits
121+
122+
- **LeakyReLU² activation**: PR #493 by @jxnl, PR #518
123+
- **Optimizer (Parameter Banking + Parallel Muon)**: [PR #399](https://github.com/openai/parameter-golf/pull/399) by @abaybektursun
124+
- **TTT recipe**: [PR #461](https://github.com/openai/parameter-golf/pull/461) by @anantdgoel
125+
- **Base model**: [PR #414](https://github.com/openai/parameter-golf/pull/414) by @signalrush
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"name": "LeakyReLU² + Legal Score-First TTT + Parallel Muon",
3+
"val_bpb": 1.1194,
4+
"bytes_total": 15990006,
5+
"blurb": "LeakyReLU(0.5)² activation (-0.003 BPB vs relu²) + legal score-first TTT (PR #461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) + Parameter Banking + Parallel Muon (PR #399). Built on PR #414 stack. 3-seed mean: 1.1194 (std 0.0006). All artifacts under 16MB, all eval under 10 min.",
6+
"author": "abaybektursun",
7+
"github_id": "abaybektursun",
8+
"date": "2026-03-23"
9+
}

0 commit comments

Comments
 (0)