|
| 1 | +# LeakyReLU² + Legal Score-First TTT + Parallel Muon |
| 2 | + |
| 3 | +**val_bpb: 1.1194** (3-seed mean, std 0.0006) | **~15.95 MB** | 8×H100 SXM |
| 4 | + |
| 5 | +## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128) |
| 6 | + |
| 7 | +| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact | |
| 8 | +|------|----------|-------|-------------|-----------------|----------|----------|----------| |
| 9 | +| 1337 | 83.3ms | 7,179 | 1.1218 | **1.1192** | -0.0026 | 410s | 15,977,386 | |
| 10 | +| 42 | 83.4ms | 7,182 | 1.1224 | **1.1200** | -0.0024 | 408s | 15,876,510 | |
| 11 | +| 2025 | 83.4ms | 7,188 | 1.1218 | **1.1189** | -0.0029 | 408s | 15,990,006 | |
| 12 | +| **Mean** | **83.4ms** | **7,183** | **1.1220** | **1.1194 (std 0.0006)** | **-0.0026** | **~409s** | | |
| 13 | + |
| 14 | +## Key Innovation: LeakyReLU(0.5)² |
| 15 | + |
| 16 | +One-line activation change that delivers -0.003 BPB: |
| 17 | + |
| 18 | +```python |
| 19 | +# Standard (relu²) |
| 20 | +x = torch.relu(self.fc(x)).square() |
| 21 | + |
| 22 | +# This submission (leaky relu²) |
| 23 | +x = F.leaky_relu(self.fc(x), negative_slope=0.5).square() |
| 24 | +``` |
| 25 | + |
| 26 | +LeakyReLU with slope 0.5 preserves negative gradient flow through the MLP, allowing the model to learn from both positive and negative pre-activations. The squaring step still produces non-negative outputs, maintaining the relu² inductive bias while eliminating dead neurons. |
| 27 | + |
| 28 | +This activation is used in PR #493 (ablated at -0.003 BPB) and PR #518 (part of their 1.0622 record submission). |
| 29 | + |
| 30 | +## Legal TTT Protocol |
| 31 | + |
| 32 | +Backward-looking, score-first TTT following PR #461's framework: |
| 33 | + |
| 34 | +1. Val tokens split into 1,893 non-overlapping 32K-token chunks |
| 35 | +2. **For each chunk**: |
| 36 | + - **SCORE**: Sliding window eval under `torch.inference_mode()` — no gradients, no weight mutation possible |
| 37 | + - **TRAIN**: SGD(lr=0.002, momentum=0.9) on the already-scored chunk. 3 epochs, all blocks unfrozen, cosine LR decay, grad clip 1.0 |
| 38 | +3. Last chunk scored but never trained on |
| 39 | +4. Chunk N scored by model adapted only on chunks 0..N-1 |
| 40 | + |
| 41 | +`inference_mode()` is a PyTorch context manager that disables gradient tracking and prohibits in-place weight mutation, providing a hard guarantee that scoring is stateless. |
| 42 | + |
| 43 | +### TTT Hyperparameters |
| 44 | + |
| 45 | +| Parameter | Value | |
| 46 | +|-----------|-------| |
| 47 | +| Chunk size | 32,768 tokens | |
| 48 | +| Optimizer | SGD + momentum(0.9) | |
| 49 | +| Learning rate | 0.002 (cosine decay across chunks) | |
| 50 | +| Epochs per chunk | 3 | |
| 51 | +| Frozen blocks | None (all blocks adapt) | |
| 52 | +| Gradient clip | 1.0 | |
| 53 | + |
| 54 | +### Timing Budget |
| 55 | + |
| 56 | +| Phase | Time | |
| 57 | +|-------|------| |
| 58 | +| Training | 600s (≤10 min) | |
| 59 | +| Standard eval (int6 roundtrip + sliding window) | ~120s | |
| 60 | +| Legal TTT (score-first sliding + adaptation) | ~410s | |
| 61 | +| **Total eval** | **~530s (< 10 min)** | |
| 62 | + |
| 63 | +## Training Architecture |
| 64 | + |
| 65 | +PR #414 stack with Parameter Banking + Parallel Muon (PR #399): |
| 66 | + |
| 67 | +| Component | Setting | |
| 68 | +|-----------|---------| |
| 69 | +| Layers | 11 (512d, 8H, 4KV) | |
| 70 | +| MLP | 3× with **LeakyReLU(0.5)²** | |
| 71 | +| BigramHash | 1536 | |
| 72 | +| XSA | Last 4 layers | |
| 73 | +| RoPE | Partial (16/64 dims) | |
| 74 | +| LN Scale | 1/√(layer+1) | |
| 75 | +| VE128 | Layers 9-10 | |
| 76 | +| Weight avg | EMA(0.997) + Tight SWA(every 50) | |
| 77 | +| Quantization | GPTQ-lite int6 + lzma | |
| 78 | +| Optimizer | Parameter Banking + Parallel Muon | |
| 79 | + |
| 80 | +### Parameter Banking + Parallel Muon |
| 81 | + |
| 82 | +First introduced in [PR #399](https://github.com/openai/parameter-golf/pull/399): |
| 83 | + |
| 84 | +- 4 contiguous 3D `nn.Parameter` banks replace 66 separate `nn.Linear` weights |
| 85 | +- Batched Newton-Schulz orthogonalization via `torch.bmm` |
| 86 | +- DDP removed for banks; async reduce-scatter → local NS → async all-gather |
| 87 | +- 83.3ms/step vs ~85ms baseline |
| 88 | + |
| 89 | +## Run Command |
| 90 | + |
| 91 | +```bash |
| 92 | +NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \ |
| 93 | +EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \ |
| 94 | +ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \ |
| 95 | +VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \ |
| 96 | +TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \ |
| 97 | +TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \ |
| 98 | +MUON_WD=0.04 ADAM_WD=0.04 \ |
| 99 | +MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \ |
| 100 | +MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \ |
| 101 | +MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \ |
| 102 | +ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \ |
| 103 | +SEED=1337 \ |
| 104 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 105 | +``` |
| 106 | + |
| 107 | +## Ablation |
| 108 | + |
| 109 | +Incremental contribution of each technique (all seed 1337): |
| 110 | + |
| 111 | +| Change | Pre-TTT bpb | Post-TTT bpb | Delta | |
| 112 | +|--------|-------------|-------------|-------| |
| 113 | +| PR #414 base (relu², BIGRAM=2048) | 1.1234 | — | — | |
| 114 | +| + Parameter Banking + Parallel Muon | 1.1234 | — | ±0.0000 | |
| 115 | +| + Legal TTT (3ep, freeze=2) | — | 1.1217 | -0.0017 | |
| 116 | +| + TTT freeze=0 (all blocks) | — | 1.1213 | -0.0004 | |
| 117 | +| + BigramHash 2048→3072 | — | 1.1204 | -0.0009 | |
| 118 | +| + **LeakyReLU(0.5)²** | 1.1213 | **1.1183** | **-0.0021** | |
| 119 | + |
| 120 | +## Credits |
| 121 | + |
| 122 | +- **LeakyReLU² activation**: PR #493 by @jxnl, PR #518 |
| 123 | +- **Optimizer (Parameter Banking + Parallel Muon)**: [PR #399](https://github.com/openai/parameter-golf/pull/399) by @abaybektursun |
| 124 | +- **TTT recipe**: [PR #461](https://github.com/openai/parameter-golf/pull/461) by @anantdgoel |
| 125 | +- **Base model**: [PR #414](https://github.com/openai/parameter-golf/pull/414) by @signalrush |
0 commit comments