|
1 | | -# Record: 11L Parallel Muon + LeakyReLU² MLP3x + Legal Score-First TTT |
| 1 | +# Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT |
2 | 2 |
|
3 | | -**3-seed mean val_bpb: 1.1253** (std=0.0002) | **~15 MB** | 8xH100 SXM |
| 3 | +**3-seed mean val_bpb: 1.1215** (std=0.0002) | **~15.85 MB** | 8xH100 SXM |
4 | 4 |
|
5 | 5 | ## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128) |
6 | 6 |
|
7 | 7 | | Seed | step_avg | steps | EMA bpb | Quantized bpb | **TTT bpb** | |
8 | 8 | |------|----------|-------|---------|---------------|-------------| |
9 | | -| 1337 | 91.5ms | 6,556 | 1.1194 | 1.1291 | **1.1255** | |
10 | | -| 42 | 89.2ms | 6,726 | 1.1195 | 1.1278 | **1.1253** | |
11 | | -| 2024 | 89.3ms | 6,722 | 1.1193 | 1.1280 | **1.1251** | |
12 | | -| **Mean** | **90.0ms** | **6,668** | **1.1194** | **1.1283** | **1.1253** | |
| 9 | +| 1337 | 88.8ms | 6,759 | 1.1161 | 1.1238 | **1.1217** | |
| 10 | +| 42 | 88.8ms | 6,757 | 1.1158 | 1.1234 | **1.1213** | |
| 11 | +| 2024 | 88.9ms | 6,752 | 1.1160 | 1.1234 | **1.1215** | |
| 12 | +| **Mean** | **88.8ms** | **6,756** | **1.1160** | **1.1235** | **1.1215** | |
13 | 13 |
|
14 | | -## Architecture (29.8M parameters) |
| 14 | +## Architecture (26.8M parameters) |
15 | 15 |
|
16 | 16 | - 11 transformer layers, dim=512, 8 heads / 4 KV heads (GQA) |
17 | 17 | - **Parallel Muon** with parameter banking (4 contiguous 3D banks) + batched Newton-Schulz |
18 | 18 | - MLP 3x expansion (hidden=1536) with **LeakyReLU(0.5)²** activation |
| 19 | +- **LN Scale** — depth-dependent normalization: 1/sqrt(layer_idx+1) |
19 | 20 | - **SmearGate** + **BigramHash(1536, dim=128)** |
20 | 21 | - **Value Residual (ResFormer)** — cache V from layer 0, blend via learned lambda |
21 | 22 | - **Gated Attention** — per-head sigmoid gate (nn.Linear, bias init 4.0) |
|
26 | 27 |
|
27 | 28 | ## Training |
28 | 29 |
|
29 | | -- **Parallel Muon optimizer**: 3-phase async reduce-scatter → Adam → NS5+all-gather |
30 | | - - lr=0.025, momentum 0.92→0.99/1500 steps, WD=0.04 |
31 | | - - No DDP — manual gradient sync for non-bank params |
| 30 | +- **Parallel Muon optimizer**: 3-phase async reduce-scatter -> Adam -> NS5+all-gather |
| 31 | + - lr=0.025, momentum 0.92->0.99/1500 steps, WD=0.04 |
| 32 | + - No DDP -- manual gradient sync for non-bank params |
32 | 33 | - Adam for embeddings (lr=0.035) and scalars (lr=0.025) |
33 | 34 | - Batch 786,432 tokens, seq_len 2048 |
34 | 35 | - EMA (decay=0.997) + SWA (every 50 steps when scale < 0.2) |
35 | 36 | - Warmdown 3500 iterations (wallclock-based) |
36 | | -- Late QAT via STE (final 15% of wallclock), symmetric [-31, 31] range |
| 37 | +- Late QAT via STE (final 15% of wallclock) |
37 | 38 | - Gradient clipping 0.3 |
38 | | -- torch.compile(fullgraph=True) — no DDP wrapper for maximum compilation |
| 39 | +- torch.compile(fullgraph=True) |
39 | 40 |
|
40 | 41 | ## Quantization |
41 | 42 |
|
42 | 43 | - Int6 uniform per-row with GPTQ-lite (5-percentile clip search per row) |
43 | 44 | - FP16 passthrough for tied embeddings |
44 | 45 | - zstd-22 compression |
45 | | -- Unbank → quantize → rebank for compatibility with parameter banking |
| 46 | +- Unbank -> quantize -> rebank for compatibility with parameter banking |
46 | 47 |
|
47 | 48 | ## Legal Score-First TTT (PR #461 / #549 recipe) |
48 | 49 |
|
49 | 50 | Every token scored BEFORE any weight update: |
50 | 51 |
|
51 | 52 | ``` |
52 | 53 | for each 32K-token chunk: |
53 | | - Phase 1 — SCORE: sliding window eval (inference_mode, stride=64) |
54 | | - Phase 2 — TRAIN: SGD(lr=0.002, momentum=0.9), 3 epochs, all blocks unfrozen, cosine LR |
| 54 | + Phase 1 -- SCORE: sliding window eval (inference_mode, stride=64) |
| 55 | + Phase 2 -- TRAIN: SGD(lr=0.002, momentum=0.9), 3 epochs, all blocks unfrozen, cosine LR |
55 | 56 | ``` |
56 | 57 |
|
57 | | -TTT improves quantized BPB by ~0.003 (1.1283 → 1.1253). |
| 58 | +TTT improves quantized BPB by ~0.002 (1.1235 -> 1.1215). |
58 | 59 |
|
59 | 60 | ## Credits |
60 | 61 |
|
61 | 62 | - Parallel Muon / Parameter Banking: PR #399 by @abaybektursun |
62 | 63 | - LeakyReLU²: PR #493 by @parinzee, PR #518 by @sofiabod |
| 64 | +- LN Scale: PR #315/374 by @jfprincz |
63 | 65 | - TTT recipe: PR #461 by @Christopher-Lee-McClendon (adapted: freeze=0) |
64 | 66 | - Base model stack: PR #414 by @signalrush |
0 commit comments