GPTQ Hessian all-reduce on PR1851 base

Itssshikhar · claude · Itssshikhar · commit 6c535836702b · 2026-04-30T15:18:02.000Z
PR openai#1851's collect_hessians (line 2037-2150 of _top_ref/train_gpt.py) computes each rank's Hessian on its own data shard subset (ShuffledSequenceLoader splits files by rank) and divides only by n_calibration_batches — without all-reduce, only rank 0's Hessian is effectively used since only rank 0 writes the quantized blob. 7/8 of calibration compute is wasted. Fix: dist.all_reduce(SUM) each Hessian (sorted iteration to avoid deadlock if key order ever drifts), divide by n_calibration_batches * world_size. Smoking- gun log line "gptq:all-rank Hessian averaging across N ranks (denom=...)" when on, "gptq:per-rank Hessian (no all-reduce, denom=...)" when off. Gated by GPTQ_ALL_REDUCE env var (default 1, the bugfix behavior). Off path preserves the original upstream semantics for clean A/B if needed. PR1493 evidence at gptq_calibration_batches=16 (PR openai#1851's default): 16-shard no-AR: q_ttt = 1.08060 16-shard AR : q_ttt = 1.07977 (delta -0.00083) At 128 calibration batches the AR delta saturates to noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/train_top.py b/train_top.py
@@ -309,6 +309,7 @@ class Hyperparameters:
     compressor = os.environ.get("COMPRESSOR", "brotli")
     gptq_calibration_batches = int(os.environ.get("GPTQ_CALIBRATION_BATCHES", 16))
     gptq_reserve_seconds = float(os.environ.get("GPTQ_RESERVE_SECONDS", 4.0))
+    gptq_all_reduce = bool(int(os.environ.get("GPTQ_ALL_REDUCE", "1")))
     phased_ttt_prefix_docs = int(os.environ.get("PHASED_TTT_PREFIX_DOCS", 2000))
     phased_ttt_num_phases = int(os.environ.get("PHASED_TTT_NUM_PHASES", 1))
     global_ttt_lr = float(os.environ.get("GLOBAL_TTT_LR", 0.001))
@@ -2145,8 +2146,20 @@ def hook_fn(module, inp, out):
         block.attn._calib = False
         block.mlp._calib = False
         block.mlp.use_fused = True
+    distributed = dist.is_available() and dist.is_initialized()
+    world_size = dist.get_world_size() if distributed else 1
+    do_ar = bool(getattr(h, "gptq_all_reduce", True)) and distributed and world_size > 1
+    if do_ar:
+        log(f"gptq:all-rank Hessian averaging across {world_size} ranks "
+            f"(denom={n_calibration_batches * world_size})")
+        for name in sorted(hessians.keys()):
+            dist.all_reduce(hessians[name], op=dist.ReduceOp.SUM)
+        denom = n_calibration_batches * world_size
+    else:
+        log(f"gptq:per-rank Hessian (no all-reduce, denom={n_calibration_batches})")
+        denom = n_calibration_batches
     for name in hessians:
-        hessians[name] = hessians[name].cpu() / n_calibration_batches
+        hessians[name] = hessians[name].cpu() / denom
     return hessians