You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three improvements to the post-training quantization pipeline on PR openai#1218:
1. Sequential cross-layer GPTQ: quantize layers one at a time, injecting
quantized weights back before collecting later layers' Hessians. This
propagates quantization error forward so later Hessians are accurate.
2. Groupwise int6 scales (group_size=128): per-group fp16 scales instead
of per-row, giving finer control over weight variance within rows.
3. Hessian-weighted scale selection: minimize H_diag-weighted error instead
of MSE when selecting per-row clip percentiles.
Zero training-time cost. Expected -0.004 to -0.008 dBPB.
Made-with: Cursor
**Status:**Implementation complete — requesting compute credits for 3-seed validation
6
6
7
7
## Approach
8
8
9
-
This submission improves the post-training quantization pipeline while keeping the training procedure identical to PR #1218. The core insight: PR #1218 loses ~0.012 BPB from quantization (pre-quant 1.1047 → post-quant 1.1162 non-sliding). Recovering even a fraction of that loss through better quantization is free — zero training-time cost, zero throughput tax.
9
+
This submission improves the post-training quantization pipeline while keeping the training procedure identical to PR #1218. The core insight: #1218 loses ~0.012 BPB from quantization (pre-quant 1.1047 → post-quant 1.1162 non-sliding). Recovering even a fraction of that loss is free — zero training-time cost, zero throughput tax.
10
10
11
11
### Changes from #1218
12
12
13
-
1.**Sequential layer-wise GPTQ propagation**: Instead of quantizing each layer independently, propagate the reconstruction error of earlier layers through to later layers' Hessian estimates. This captures cross-layer error accumulation that vanilla per-layer GPTQ misses.
2.**Groupwise int6 scales** (group_size=128): Replace per-row scales with per-group scales, giving the quantizer finer control over weight distributions within each row. The scale overhead is small (~2% of weight size) but the MSE reduction is significant for layers with heterogeneous weight magnitudes.
15
+
Instead of collecting all Hessians in a single pass and quantizing each layer independently, we process layers one at a time: collect Hessian for layer *i*, quantize it with GPTQ, **inject the quantized weights back into the model**, then collect the Hessian for layer *i+1*. This means later layers' Hessians reflect the actual quantized activations they'll see at eval time, capturing cross-layer error accumulation that per-layer GPTQ misses.
16
16
17
-
3.**Hessian-weighted scale selection**: Instead of searching over percentile-based clip candidates using MSE, select scales that minimize the Hessian-weighted quantization error `(W - Q)^T H (W - Q)`, which directly optimizes for output reconstruction quality.
17
+
Controlled by `GPTQ_SEQUENTIAL=1` (default on).
18
+
19
+
#### 2. Groupwise int6 scales (`group_size=128`)
20
+
21
+
Replace per-row scales with per-group scales (128 columns per group). Each group of weights gets its own fp16 scale factor, giving the quantizer finer control over heterogeneous weight distributions within each row. The scale storage overhead is small (~2% of weight bytes) but the reconstruction error reduction is significant for layers with high weight variance.
22
+
23
+
Controlled by `GPTQ_GROUP_SIZE=128` (default).
24
+
25
+
#### 3. Hessian-weighted scale selection
26
+
27
+
For per-row mode, instead of selecting scales by minimizing MSE `(W - Q)^2`, we minimize the Hessian-weighted error `sum(H_diag * (W - Q)^2)`, which directly optimizes for output reconstruction quality. Columns with high Hessian diagonal (high activation variance) get proportionally more weight in the error metric.
18
28
19
29
### Why this is distinct
20
30
21
-
-**Not in PR #1218**: #1218 uses independent per-layer GPTQ with MSE-based scale search.
22
-
-**Not in PR #1019**: #1019 focuses on self-generated calibration data; this improves the GPTQ algorithm itself.
31
+
-**Not in PR #1218**: #1218 uses independent per-layer GPTQ with per-row scales and MSE-based selection.
32
+
-**Not in PR #1019**: #1019 focuses on self-generated calibration data; this improves the quantization algorithm itself.
23
33
-**Not in PR #1204 / #1209**: Those PRs focus on architecture changes (parallel residuals, depth recurrence, TTT).
24
-
-**No SLOT / TTT**: This is a pure post-training compression improvement — clean causal submission.
0 commit comments