Add Combined Int6 + QAT + Sliding Window submission

pleasedontddosme · claude · pleasedontddosme · commit 597924c0969d · 2026-03-20T02:41:32.000+01:00
Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2): - Int6 quant, FP16 tied embeddings, Late-K passthrough - Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix - Muon decoupled weight decay, AdamW for embeddings/scalars - Novel: QAT with STE in last 30% of training for near-zero quant penalty - Cosine warmdown schedule, higher Muon momentum warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
diff --git a/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/README.md b/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/README.md
@@ -0,0 +1,38 @@
+# Combined Int6 + QAT + Sliding Window
+
+## Strategy
+Combine the best techniques from the top two submissions plus novel QAT.
+
+## Techniques from WarmdownQuantization (#1, 1.1574 bpb)
+- **Int6 quantization** ([-31,31] range, better zlib compression than int8)
+- **FP16 tied embeddings** (avoid int8 compounding on shared input/output matrix)
+- **Late-K passthrough** (last 2 layers' key weights in fp16)
+- **Aggressive warmdown** (WARMDOWN_ITERS=20000, entire training is LR decay)
+- **Higher LRs** (MATRIX_LR=0.06, SCALAR_LR=0.06)
+- **Grad clipping** (GRAD_CLIP_NORM=1.0)
+- **9 layers, 3x MLP** (hidden=1536)
+
+## Techniques from SlidingWindow (#2, 1.1748 bpb)
+- **Batched sliding window eval** (stride=64, compiled forward_logits, batch_size=256)
+- **Overtone spectral embedding init** (SVD power-law spectrum shaping)
+- **Phase-transition resid_mix init** (sigmoid-scheduled)
+- **Muon decoupled weight decay** (0.02 * lr after each step)
+- **AdamW** for embeddings and scalar params (weight_decay=0.01)
+- **Higher tied embed LR** (0.10 vs 0.07)
+
+## Novel Contributions
+1. **Quantization-Aware Training (QAT)** with straight-through estimator (STE):
+   - In the last 30% of training, inject int6 quantization simulation in forward pass
+   - Model learns to be robust to int6 quantization, reducing post-quant penalty to near-zero
+   - Only applied to large weight matrices (>65K params), not small control tensors
+2. **Cosine warmdown** instead of linear (smoother LR decay, better final weights)
+3. **Higher Muon momentum warmup** (700 steps vs 500) for stability with higher LRs
+
+## Reproduction
+```bash
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+All hyperparameters are set as defaults in the script. Override via environment variables if needed.
+
+## Expected Results
+Target: ~1.150-1.155 bpb (improvement over 1.1574 baseline)
diff --git a/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/submission.json b/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/submission.json
@@ -0,0 +1,25 @@
+{
+    "name": "Combined_Int6_QAT_SlidingWindow",
+    "author": "DizSi",
+    "date": "2026-03-20",
+    "track": "10min_16mb",
+    "description": "Combines Int6 quantization + FP16 tied embeddings + Late-K passthrough from WarmdownQuantization with batched sliding window eval + overtone init + phase-transition resid_mix + Muon weight decay + AdamW from SlidingWindow. Novel addition: QAT (quantization-aware training) with STE in last 30% of steps + cosine warmdown schedule.",
+    "base_submissions": [
+        "2026-03-19_WarmdownQuantization",
+        "2026-03-19_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit"
+    ],
+    "env": {
+        "WARMDOWN_ITERS": 20000,
+        "MATRIX_LR": 0.06,
+        "TIED_EMBED_LR": 0.10,
+        "SCALAR_LR": 0.06,
+        "GRAD_CLIP_NORM": 1.0,
+        "MUON_BACKEND_STEPS": 5,
+        "MUON_MOMENTUM_WARMUP_STEPS": 700,
+        "NUM_LAYERS": 9,
+        "MLP_HIDDEN": 1536,
+        "EVAL_SEQ_LEN": 1024,
+        "EVAL_STRIDE": 64,
+        "QAT_START_FRAC": 0.70
+    }
+}
diff --git a/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/train_gpt.py b/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/train_gpt.py