Skip to content

Commit c2a100b

Browse files
dexhunterparameter-golf fork
authored andcommitted
Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 — val_bpb 1.0912 (3-seed mean)
WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
1 parent 1dd08fb commit c2a100b

6 files changed

Lines changed: 1027 additions & 0 deletions

File tree

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
## Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ (val_bpb: 1.0912)
2+
3+
**val_bpb = 1.0912** (3-seed mean, std 0.0009) | **2.5106 nats** | **~15.96 MB** | 8xH100 SXM, 590s train + ~76s eval | No TTT
4+
5+
Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev (4096-Vocab + 4.0-MLP-mult).
6+
7+
Previous: [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> [PR #1260](https://github.com/openai/parameter-golf/pull/1260) (1.0929) -> this (1.0912)
8+
9+
### Changes from PR #1218
10+
11+
| | PR #1218 | This |
12+
|---|---|---|
13+
| val_bpb | 1.09785 | **1.09124** |
14+
| Optimizer | Muon | **MuonEq-R** (row-norm before NS5) |
15+
| Depth recurrence | None | **Layers 4,5 repeated** |
16+
| Weight decay | 0.085 | **0.090** |
17+
| Mixed quantization | No | **All int6** (66/66 layers) |
18+
| Everything else | Same | Same |
19+
20+
### Key Innovation: WD-Quantization Synergy
21+
22+
The critical insight: **higher weight decay (0.090 vs 0.085) produces smaller weights that compress 5% better under brotli-11**, creating enough artifact headroom to keep **ALL 66 layers at int6 precision** (vs 60-61 int6 in previous PRs). The extra quantization precision more than recovers the BPP cost of higher weight decay:
23+
24+
| Config | WD | N_INT6 | Artifact | BPB (seed 42) |
25+
|--------|-----|--------|----------|---------------|
26+
| PR #1260 | 0.085 | 60 | 15,981K | 1.09217 |
27+
| PR #1279 | 0.085 | 61 | 15,997K | 1.09170 |
28+
| **This** | **0.090** | **66** | **15,967K** | **1.09057** |
29+
30+
### What's New
31+
32+
1. **WD=0.090** — Increased from 0.085. Higher WD reduces weight magnitudes, improving brotli-11 compression by ~5%. This creates ~280K bytes of artifact headroom (vs 3K margin at WD=0.085/N61).
33+
34+
2. **All-Int6 GPTQ** — With the compression headroom from WD=0.090, we can keep ALL 66 weight layers at int6 precision (clip_range=31). No layers need to be demoted to int5. This is the theoretical maximum quantization quality for the given architecture.
35+
36+
3. **MuonEq-R** — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement.
37+
38+
4. **Depth Recurrence** — Layers 4,5 repeated with fully shared MLP (zero extra params). ~0.003 BPB improvement.
39+
40+
### Carried from PR #1218
41+
42+
- 4096 SentencePiece BPE vocabulary
43+
- 4.0x MLP multiplier with sigmoid-gated activation
44+
- Full Hessian GPTQ quantization
45+
- XSA-all-11 attention
46+
- BigramHash embedding (2816x160)
47+
- Sigmoid-gated skip connections + soft-round QAT
48+
- Split-LR training
49+
- Brotli-11 compression with byte shuffle
50+
- EMA (decay 0.997)
51+
52+
### Configuration
53+
54+
```bash
55+
NCCL_NET=Socket \
56+
DATA_DIR=./data \
57+
SEED=42 \
58+
MIXED_QUANT=1 \
59+
N_INT6_LAYERS=66 \
60+
MUON_WD=0.090 \
61+
EMBED_WD=0.090 \
62+
RECUR_LAYERS=4,5 \
63+
torchrun --standalone --nproc_per_node=8 train_gpt.py
64+
```
65+
66+
## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT)
67+
68+
### Core Results
69+
70+
| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact |
71+
|------|-------|---------|--------------|-------------|-----------------|----------|
72+
| 42 | 5,540 | 106.5 | 1.0990 | 1.0906 | 2.50910 | 15,967,483 |
73+
| 0 | 5,536 | 106.6 | 1.0992 | 1.0908 | 2.50973 | 15,962,242 |
74+
| 1337 | 5,538 | 106.6 | 1.0998 | 1.0923 | 2.51309 | 15,959,253 |
75+
| **Mean** | **5,538** | **106.6** | **1.0993** | **1.0912** | **2.51064** | **15,962,993** |
76+
77+
### Supplemental Diagnostics
78+
79+
| Seed | Post-EMA BPB | Roundtrip BPB | Sliding BPB | val_loss (nats) | Code size | Total submission | Train time | Eval time |
80+
|------|--------------|---------------|-------------|-----------------|-----------|------------------|------------|-----------|
81+
| 42 | 1.0990 | 1.1081 | 1.0906 | 2.50910 | 21,396 | 15,967,483 | 590s | 83s |
82+
| 0 | 1.0992 | 1.1082 | 1.0908 | 2.50973 | 21,396 | 15,962,242 | 590s | 83s |
83+
| 1337 | 1.0998 | 1.1101 | 1.0923 | 2.51309 | 21,396 | 15,959,253 | 590s | 83s |
84+
| **Mean** | **1.0993** | **1.1088** | **1.0912** | **2.51064** | **21,396** | **15,962,993** | **590s** | **83s** |
85+
86+
### Rule Compliance
87+
88+
- No TTT (no test-time training or adaptation)
89+
- No SLOT (no scored-position lookup table)
90+
- No validation data during training
91+
- No training data during evaluation
92+
- Artifact < 16,000,000 bytes for ALL seeds (max: 15,967,483, min margin: 32,517)
93+
- Train < 600s on 8xH100 SXM (590s)
94+
- Eval < 600s on 8xH100 SXM (~83s)
95+
96+
### Architecture
97+
98+
- 11 layers + 2 virtual (depth recurrence on layers 4,5)
99+
- d_model = 512, MLP 4x (2048), 8 heads, 4 KV heads
100+
- 4096 SentencePiece BPE vocabulary
101+
- BigramHash(2816x160) token embedding
102+
- Sigmoid-gated skip connections with soft-round QAT
103+
- MuonEq-R optimizer with row normalization
104+
- Full Hessian GPTQ — all 66 layers at int6 precision
105+
- Weight decay 0.090 (muon + embed)
106+
107+
### Run Command (3-seed loop)
108+
109+
```bash
110+
for SEED in 42 0 1337; do
111+
NCCL_NET=Socket \
112+
DATA_DIR=./data \
113+
SEED=$SEED \
114+
MIXED_QUANT=1 \
115+
N_INT6_LAYERS=66 \
116+
MUON_WD=0.090 \
117+
EMBED_WD=0.090 \
118+
RECUR_LAYERS=4,5 \
119+
torchrun --standalone --nproc_per_node=8 train_gpt.py \
120+
2>&1 | tee train_seed${SEED}.log
121+
done
122+
```
123+
124+
### Lineage
125+
126+
PR #1019 (1.1147) -> PR #1218 (1.0979) -> PR #1260 (1.0929) -> this (1.0912)
127+
128+
### Credits
129+
130+
- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture — the WD insight)
131+
- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline)
132+
- @msisovic for PR #1204 (depth recurrence concept)
133+
- @dexhunter for PR #1260 (MuonEq-R + recurrence + mixed quant)
134+
135+
### Included Files
136+
137+
- `train_gpt.py` — full training + quantization + evaluation script (21,396 bytes, self-extracting)
138+
- `train_seed42.log`, `train_seed0.log`, `train_seed1337.log` — all seed logs
139+
- `submission.json` — leaderboard metadata
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"name": "Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ",
3+
"val_bpb": 1.0912,
4+
"bytes_total": 15967483,
5+
"blurb": "WD-quantization synergy: higher weight decay (0.090) improves compression enough to keep ALL 66 layers at int6. Combined with MuonEq-R and depth recurrence. 3-seed mean 1.0912 BPB / 2.5106 nats. No TTT, no SLOT.",
6+
"author": "dexhunter",
7+
"github_id": "dexhunter",
8+
"date": "2026-04-03",
9+
"pre_quant_val_bpb": 1.0993,
10+
"bytes_model_compressed": 15946087,
11+
"bytes_code": 21396,
12+
"base_pr": 1218,
13+
"seeds": [42, 0, 1337],
14+
"seed_scores": [1.09057, 1.09084, 1.09230],
15+
"eval_time_seconds": [83, 83, 83]
16+
}

0 commit comments

Comments
 (0)