Commit 6c53583
GPTQ Hessian all-reduce on PR1851 base
PR openai#1851's collect_hessians (line 2037-2150 of _top_ref/train_gpt.py) computes
each rank's Hessian on its own data shard subset (ShuffledSequenceLoader splits
files by rank) and divides only by n_calibration_batches — without all-reduce,
only rank 0's Hessian is effectively used since only rank 0 writes the
quantized blob. 7/8 of calibration compute is wasted.
Fix: dist.all_reduce(SUM) each Hessian (sorted iteration to avoid deadlock if
key order ever drifts), divide by n_calibration_batches * world_size. Smoking-
gun log line "gptq:all-rank Hessian averaging across N ranks (denom=...)"
when on, "gptq:per-rank Hessian (no all-reduce, denom=...)" when off.
Gated by GPTQ_ALL_REDUCE env var (default 1, the bugfix behavior). Off path
preserves the original upstream semantics for clean A/B if needed.
PR1493 evidence at gptq_calibration_batches=16 (PR openai#1851's default):
16-shard no-AR: q_ttt = 1.08060
16-shard AR : q_ttt = 1.07977 (delta -0.00083)
At 128 calibration batches the AR delta saturates to noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 3f661b3 commit 6c53583
1 file changed
Lines changed: 14 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
309 | 309 | | |
310 | 310 | | |
311 | 311 | | |
| 312 | + | |
312 | 313 | | |
313 | 314 | | |
314 | 315 | | |
| |||
2145 | 2146 | | |
2146 | 2147 | | |
2147 | 2148 | | |
| 2149 | + | |
| 2150 | + | |
| 2151 | + | |
| 2152 | + | |
| 2153 | + | |
| 2154 | + | |
| 2155 | + | |
| 2156 | + | |
| 2157 | + | |
| 2158 | + | |
| 2159 | + | |
| 2160 | + | |
2148 | 2161 | | |
2149 | | - | |
| 2162 | + | |
2150 | 2163 | | |
2151 | 2164 | | |
2152 | 2165 | | |
| |||
0 commit comments