Skip to content

Commit 597924c

Browse files
Add Combined Int6 + QAT + Sliding Window submission
Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2): - Int6 quant, FP16 tied embeddings, Late-K passthrough - Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix - Muon decoupled weight decay, AdamW for embeddings/scalars - Novel: QAT with STE in last 30% of training for near-zero quant penalty - Cosine warmdown schedule, higher Muon momentum warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 45bbccf commit 597924c

3 files changed

Lines changed: 1322 additions & 0 deletions

File tree

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Combined Int6 + QAT + Sliding Window
2+
3+
## Strategy
4+
Combine the best techniques from the top two submissions plus novel QAT.
5+
6+
## Techniques from WarmdownQuantization (#1, 1.1574 bpb)
7+
- **Int6 quantization** ([-31,31] range, better zlib compression than int8)
8+
- **FP16 tied embeddings** (avoid int8 compounding on shared input/output matrix)
9+
- **Late-K passthrough** (last 2 layers' key weights in fp16)
10+
- **Aggressive warmdown** (WARMDOWN_ITERS=20000, entire training is LR decay)
11+
- **Higher LRs** (MATRIX_LR=0.06, SCALAR_LR=0.06)
12+
- **Grad clipping** (GRAD_CLIP_NORM=1.0)
13+
- **9 layers, 3x MLP** (hidden=1536)
14+
15+
## Techniques from SlidingWindow (#2, 1.1748 bpb)
16+
- **Batched sliding window eval** (stride=64, compiled forward_logits, batch_size=256)
17+
- **Overtone spectral embedding init** (SVD power-law spectrum shaping)
18+
- **Phase-transition resid_mix init** (sigmoid-scheduled)
19+
- **Muon decoupled weight decay** (0.02 * lr after each step)
20+
- **AdamW** for embeddings and scalar params (weight_decay=0.01)
21+
- **Higher tied embed LR** (0.10 vs 0.07)
22+
23+
## Novel Contributions
24+
1. **Quantization-Aware Training (QAT)** with straight-through estimator (STE):
25+
- In the last 30% of training, inject int6 quantization simulation in forward pass
26+
- Model learns to be robust to int6 quantization, reducing post-quant penalty to near-zero
27+
- Only applied to large weight matrices (>65K params), not small control tensors
28+
2. **Cosine warmdown** instead of linear (smoother LR decay, better final weights)
29+
3. **Higher Muon momentum warmup** (700 steps vs 500) for stability with higher LRs
30+
31+
## Reproduction
32+
```bash
33+
torchrun --standalone --nproc_per_node=8 train_gpt.py
34+
```
35+
All hyperparameters are set as defaults in the script. Override via environment variables if needed.
36+
37+
## Expected Results
38+
Target: ~1.150-1.155 bpb (improvement over 1.1574 baseline)
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
{
2+
"name": "Combined_Int6_QAT_SlidingWindow",
3+
"author": "DizSi",
4+
"date": "2026-03-20",
5+
"track": "10min_16mb",
6+
"description": "Combines Int6 quantization + FP16 tied embeddings + Late-K passthrough from WarmdownQuantization with batched sliding window eval + overtone init + phase-transition resid_mix + Muon weight decay + AdamW from SlidingWindow. Novel addition: QAT (quantization-aware training) with STE in last 30% of steps + cosine warmdown schedule.",
7+
"base_submissions": [
8+
"2026-03-19_WarmdownQuantization",
9+
"2026-03-19_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit"
10+
],
11+
"env": {
12+
"WARMDOWN_ITERS": 20000,
13+
"MATRIX_LR": 0.06,
14+
"TIED_EMBED_LR": 0.10,
15+
"SCALAR_LR": 0.06,
16+
"GRAD_CLIP_NORM": 1.0,
17+
"MUON_BACKEND_STEPS": 5,
18+
"MUON_MOMENTUM_WARMUP_STEPS": 700,
19+
"NUM_LAYERS": 9,
20+
"MLP_HIDDEN": 1536,
21+
"EVAL_SEQ_LEN": 1024,
22+
"EVAL_STRIDE": 64,
23+
"QAT_START_FRAC": 0.70
24+
}
25+
}

0 commit comments

Comments
 (0)