|
| 1 | +## Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ (val_bpb: 1.0912) |
| 2 | + |
| 3 | +**val_bpb = 1.0912** (3-seed mean, std 0.0009) | **2.5106 nats** | **~15.96 MB** | 8xH100 SXM, 590s train + ~76s eval | No TTT |
| 4 | + |
| 5 | +Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev (4096-Vocab + 4.0-MLP-mult). |
| 6 | + |
| 7 | +Previous: [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> [PR #1260](https://github.com/openai/parameter-golf/pull/1260) (1.0929) -> this (1.0912) |
| 8 | + |
| 9 | +### Changes from PR #1218 |
| 10 | + |
| 11 | +| | PR #1218 | This | |
| 12 | +|---|---|---| |
| 13 | +| val_bpb | 1.09785 | **1.09124** | |
| 14 | +| Optimizer | Muon | **MuonEq-R** (row-norm before NS5) | |
| 15 | +| Depth recurrence | None | **Layers 4,5 repeated** | |
| 16 | +| Weight decay | 0.085 | **0.090** | |
| 17 | +| Mixed quantization | No | **All int6** (66/66 layers) | |
| 18 | +| Everything else | Same | Same | |
| 19 | + |
| 20 | +### Key Innovation: WD-Quantization Synergy |
| 21 | + |
| 22 | +The critical insight: **higher weight decay (0.090 vs 0.085) produces smaller weights that compress 5% better under brotli-11**, creating enough artifact headroom to keep **ALL 66 layers at int6 precision** (vs 60-61 int6 in previous PRs). The extra quantization precision more than recovers the BPP cost of higher weight decay: |
| 23 | + |
| 24 | +| Config | WD | N_INT6 | Artifact | BPB (seed 42) | |
| 25 | +|--------|-----|--------|----------|---------------| |
| 26 | +| PR #1260 | 0.085 | 60 | 15,981K | 1.09217 | |
| 27 | +| PR #1279 | 0.085 | 61 | 15,997K | 1.09170 | |
| 28 | +| **This** | **0.090** | **66** | **15,967K** | **1.09057** | |
| 29 | + |
| 30 | +### What's New |
| 31 | + |
| 32 | +1. **WD=0.090** — Increased from 0.085. Higher WD reduces weight magnitudes, improving brotli-11 compression by ~5%. This creates ~280K bytes of artifact headroom (vs 3K margin at WD=0.085/N61). |
| 33 | + |
| 34 | +2. **All-Int6 GPTQ** — With the compression headroom from WD=0.090, we can keep ALL 66 weight layers at int6 precision (clip_range=31). No layers need to be demoted to int5. This is the theoretical maximum quantization quality for the given architecture. |
| 35 | + |
| 36 | +3. **MuonEq-R** — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement. |
| 37 | + |
| 38 | +4. **Depth Recurrence** — Layers 4,5 repeated with fully shared MLP (zero extra params). ~0.003 BPB improvement. |
| 39 | + |
| 40 | +### Carried from PR #1218 |
| 41 | + |
| 42 | +- 4096 SentencePiece BPE vocabulary |
| 43 | +- 4.0x MLP multiplier with sigmoid-gated activation |
| 44 | +- Full Hessian GPTQ quantization |
| 45 | +- XSA-all-11 attention |
| 46 | +- BigramHash embedding (2816x160) |
| 47 | +- Sigmoid-gated skip connections + soft-round QAT |
| 48 | +- Split-LR training |
| 49 | +- Brotli-11 compression with byte shuffle |
| 50 | +- EMA (decay 0.997) |
| 51 | + |
| 52 | +### Configuration |
| 53 | + |
| 54 | +```bash |
| 55 | +NCCL_NET=Socket \ |
| 56 | +DATA_DIR=./data \ |
| 57 | +SEED=42 \ |
| 58 | +MIXED_QUANT=1 \ |
| 59 | +N_INT6_LAYERS=66 \ |
| 60 | +MUON_WD=0.090 \ |
| 61 | +EMBED_WD=0.090 \ |
| 62 | +RECUR_LAYERS=4,5 \ |
| 63 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 64 | +``` |
| 65 | + |
| 66 | +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT) |
| 67 | + |
| 68 | +### Core Results |
| 69 | + |
| 70 | +| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact | |
| 71 | +|------|-------|---------|--------------|-------------|-----------------|----------| |
| 72 | +| 42 | 5,540 | 106.5 | 1.0990 | 1.0906 | 2.50910 | 15,967,483 | |
| 73 | +| 0 | 5,536 | 106.6 | 1.0992 | 1.0908 | 2.50973 | 15,962,242 | |
| 74 | +| 1337 | 5,538 | 106.6 | 1.0998 | 1.0923 | 2.51309 | 15,959,253 | |
| 75 | +| **Mean** | **5,538** | **106.6** | **1.0993** | **1.0912** | **2.51064** | **15,962,993** | |
| 76 | + |
| 77 | +### Supplemental Diagnostics |
| 78 | + |
| 79 | +| Seed | Post-EMA BPB | Roundtrip BPB | Sliding BPB | val_loss (nats) | Code size | Total submission | Train time | Eval time | |
| 80 | +|------|--------------|---------------|-------------|-----------------|-----------|------------------|------------|-----------| |
| 81 | +| 42 | 1.0990 | 1.1081 | 1.0906 | 2.50910 | 21,396 | 15,967,483 | 590s | 83s | |
| 82 | +| 0 | 1.0992 | 1.1082 | 1.0908 | 2.50973 | 21,396 | 15,962,242 | 590s | 83s | |
| 83 | +| 1337 | 1.0998 | 1.1101 | 1.0923 | 2.51309 | 21,396 | 15,959,253 | 590s | 83s | |
| 84 | +| **Mean** | **1.0993** | **1.1088** | **1.0912** | **2.51064** | **21,396** | **15,962,993** | **590s** | **83s** | |
| 85 | + |
| 86 | +### Rule Compliance |
| 87 | + |
| 88 | +- No TTT (no test-time training or adaptation) |
| 89 | +- No SLOT (no scored-position lookup table) |
| 90 | +- No validation data during training |
| 91 | +- No training data during evaluation |
| 92 | +- Artifact < 16,000,000 bytes for ALL seeds (max: 15,967,483, min margin: 32,517) |
| 93 | +- Train < 600s on 8xH100 SXM (590s) |
| 94 | +- Eval < 600s on 8xH100 SXM (~83s) |
| 95 | + |
| 96 | +### Architecture |
| 97 | + |
| 98 | +- 11 layers + 2 virtual (depth recurrence on layers 4,5) |
| 99 | +- d_model = 512, MLP 4x (2048), 8 heads, 4 KV heads |
| 100 | +- 4096 SentencePiece BPE vocabulary |
| 101 | +- BigramHash(2816x160) token embedding |
| 102 | +- Sigmoid-gated skip connections with soft-round QAT |
| 103 | +- MuonEq-R optimizer with row normalization |
| 104 | +- Full Hessian GPTQ — all 66 layers at int6 precision |
| 105 | +- Weight decay 0.090 (muon + embed) |
| 106 | + |
| 107 | +### Run Command (3-seed loop) |
| 108 | + |
| 109 | +```bash |
| 110 | +for SEED in 42 0 1337; do |
| 111 | + NCCL_NET=Socket \ |
| 112 | + DATA_DIR=./data \ |
| 113 | + SEED=$SEED \ |
| 114 | + MIXED_QUANT=1 \ |
| 115 | + N_INT6_LAYERS=66 \ |
| 116 | + MUON_WD=0.090 \ |
| 117 | + EMBED_WD=0.090 \ |
| 118 | + RECUR_LAYERS=4,5 \ |
| 119 | + torchrun --standalone --nproc_per_node=8 train_gpt.py \ |
| 120 | + 2>&1 | tee train_seed${SEED}.log |
| 121 | +done |
| 122 | +``` |
| 123 | + |
| 124 | +### Lineage |
| 125 | + |
| 126 | +PR #1019 (1.1147) -> PR #1218 (1.0979) -> PR #1260 (1.0929) -> this (1.0912) |
| 127 | + |
| 128 | +### Credits |
| 129 | + |
| 130 | +- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture — the WD insight) |
| 131 | +- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline) |
| 132 | +- @msisovic for PR #1204 (depth recurrence concept) |
| 133 | +- @dexhunter for PR #1260 (MuonEq-R + recurrence + mixed quant) |
| 134 | + |
| 135 | +### Included Files |
| 136 | + |
| 137 | +- `train_gpt.py` — full training + quantization + evaluation script (21,396 bytes, self-extracting) |
| 138 | +- `train_seed42.log`, `train_seed0.log`, `train_seed1337.log` — all seed logs |
| 139 | +- `submission.json` — leaderboard metadata |
0 commit comments