exp_074_prequant_ttt: Pre-quant AdamW TTT (ready, untested)

AjAnubolu · AjAnubolu · commit a4a7f762edc9 · 2026-04-08T19:36:43.000-07:00
Runs AdamW TTT on the full-precision EMA model BEFORE GPTQ quantization. Based on PR openai#1364 which reports -0.027 BPB from this technique alone. Flow: Train -> EMA -> AdamW TTT (3 epochs, freeze 2 blocks) -> GPTQ -> eval Key fix: destroy_process_group + reinit pattern to avoid NCCL watchdog timeout during the ~13-min single-rank TTT phase. Standard dist.barrier() is insufficient because NCCL's heartbeat thread times out independently. Env: PREQUANT_TTT_ENABLED=1 PREQUANT_TTT_EPOCHS=3 PREQUANT_TTT_LR=3e-4 PREQUANT_TTT_FREEZE_BLOCKS=2
diff --git a/experiments/exp_074_prequant_ttt/README.md b/experiments/exp_074_prequant_ttt/README.md
@@ -0,0 +1,54 @@
+# exp_074_prequant_ttt — Pre-quant AdamW TTT (READY, untested)
+
+**Hypothesis**: Running AdamW TTT on the **full-precision EMA model before GPTQ** should give a much larger BPB improvement than post-quant SGD TTT.
+
+**Source**: [PR #1364](https://github.com/openai/parameter-golf/pull/1364) reports −0.027 BPB from this technique alone (1.1025 BPB 3-seed mean).
+
+## Why this works
+
+Post-quant SGD TTT on int6 weights is unstable — we observed +0.030 BPB
+penalty with naive SGD TTT on a GPTQ-quantized model (see PR #756's
+25 failed attempts). Running TTT **before** quantization:
+
+1. Avoids optimizer instability on the quantized weight manifold
+2. Lets GPTQ see the TTT-adapted Hessians during calibration
+3. Uses AdamW (not SGD) for better adaptation dynamics
+
+## Flow
+
+```
+Train 600s → EMA model (bf16)
+           → AdamW TTT on full-precision model (3 epochs)
+           → GPTQ quantize the adapted model
+           → Sliding window eval (no further TTT)
+```
+
+## NCCL Timeout Fix
+
+Pre-quant TTT runs for ~13 minutes on rank 0 only, exceeding NCCL's
+default watchdog timeout (600s). Fix:
+
+```python
+if distributed:
+    dist.barrier()
+    dist.destroy_process_group()
+# ... rank 0 runs TTT ...
+if distributed:
+    dist.init_process_group(backend="nccl", device_id=device)
+    for p in base_model.parameters():
+        dist.broadcast(p.data, src=0)
+```
+
+## Running
+
+```bash
+PREQUANT_TTT_ENABLED=1 PREQUANT_TTT_EPOCHS=3 PREQUANT_TTT_LR=3e-4 \
+  PREQUANT_TTT_FREEZE_BLOCKS=2 GPTQ_ENABLED=1 GPTQ_N_BATCHES=64 \
+  TTT_ENABLED=0 EVAL_STRIDE=64 SEED=1337 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Expected Result
+
+Targeting ~1.10-1.12 BPB (a −0.01 to −0.027 BPB gain from the post-EMA 1.142).
+Full PR #1364 reports 1.1025 at 6 epochs; we use 3 epochs to halve the TTT time.
diff --git a/experiments/exp_074_prequant_ttt/run_8xh100.sh b/experiments/exp_074_prequant_ttt/run_8xh100.sh
@@ -0,0 +1,42 @@
+#!/bin/bash
+# Parameter Golf - exp_074: Pre-quant AdamW TTT
+# Requirements: 8xH100 SXM, PyTorch 2.x, CUDA 12.x
+# Expected time: ~35 min total (12 min FA3 build + 10 min train + 13 min TTT + 5 min GPTQ/eval)
+set -e
+
+echo "=== Step 1: Install dependencies ==="
+pip install tiktoken blobfile tqdm lm_eval sentencepiece 2>/dev/null
+
+echo "=== Step 2: Build Flash Attention 3 (Hopper kernels) ==="
+echo "This takes ~12 minutes. DO NOT SKIP - FA3 gives ~86ms/step vs ~100ms with FA2."
+pip install flash-attn --no-build-isolation 2>&1 | tail -5
+# Verify FA3
+python -c "from flash_attn_interface import flash_attn_func; print('FA3 OK')" 2>/dev/null \
+  || python -c "from flash_attn import flash_attn_func; print('FA2 fallback (slower)')"
+
+echo "=== Step 3: Download training data ==="
+# The script auto-downloads data, but we can pre-fetch for speed
+python -c "
+import subprocess, os
+os.makedirs('data', exist_ok=True)
+if not os.path.exists('data/cached_challenge_fineweb.py'):
+    subprocess.run(['wget', '-q', '-O', 'data/cached_challenge_fineweb.py',
+        'https://raw.githubusercontent.com/openai/parameter-golf/main/data/cached_challenge_fineweb.py'])
+" 2>/dev/null
+python data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80 2>&1 | tail -3
+
+echo "=== Step 4: Run experiment ==="
+echo "Training 600s → EMA → Pre-quant AdamW TTT (3 epochs) → GPTQ → Sliding eval"
+PREQUANT_TTT_ENABLED=1 \
+PREQUANT_TTT_EPOCHS=3 \
+PREQUANT_TTT_LR=3e-4 \
+PREQUANT_TTT_FREEZE_BLOCKS=2 \
+GPTQ_ENABLED=1 \
+GPTQ_N_BATCHES=64 \
+TTT_ENABLED=0 \
+EVAL_STRIDE=64 \
+SEED=1337 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee exp074_results.log
+
+echo "=== Done! Check exp074_results.log for val_bpb ==="
+grep -E "val_bpb|DIAGNOSTIC|prequant_ttt|sliding" exp074_results.log | tail -20
diff --git a/experiments/exp_074_prequant_ttt/train_gpt.py b/experiments/exp_074_prequant_ttt/train_gpt.py