Skip to content

Commit d6f723d

Browse files
committed
Implement sequential GPTQ + groupwise int6 + Hessian-weighted scales
Three improvements to the post-training quantization pipeline on PR openai#1218: 1. Sequential cross-layer GPTQ: quantize layers one at a time, injecting quantized weights back before collecting later layers' Hessians. This propagates quantization error forward so later Hessians are accurate. 2. Groupwise int6 scales (group_size=128): per-group fp16 scales instead of per-row, giving finer control over weight variance within rows. 3. Hessian-weighted scale selection: minimize H_diag-weighted error instead of MSE when selecting per-row clip percentiles. Zero training-time cost. Expected -0.004 to -0.008 dBPB. Made-with: Cursor
1 parent 1e88b09 commit d6f723d

2 files changed

Lines changed: 280 additions & 51 deletions

File tree

records/track_10min_16mb/2026-04-16_SP4096_SequentialGPTQ_GroupwiseInt6/README.md

Lines changed: 40 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,37 @@
11
# WIP: Sequential GPTQ with Groupwise Int6 Quantization
22

3-
**Track:** 10min / 8×H100 / 16MB artifact
4-
**Base:** PR #1218 (clarkkev — SP4096, MLP 4×, WD 0.085, brotli, XSA-all) — 1.098 BPB baseline
5-
**Status:** Work in progress — requesting compute credits for 3-seed validation
3+
**Track:** 10min / 8×H100 / 16MB artifact
4+
**Base:** PR #1218 (clarkkev — SP4096, MLP 4×, WD 0.085, brotli, XSA-all) — 1.098 BPB baseline
5+
**Status:** Implementation complete — requesting compute credits for 3-seed validation
66

77
## Approach
88

9-
This submission improves the post-training quantization pipeline while keeping the training procedure identical to PR #1218. The core insight: PR #1218 loses ~0.012 BPB from quantization (pre-quant 1.1047 → post-quant 1.1162 non-sliding). Recovering even a fraction of that loss through better quantization is free — zero training-time cost, zero throughput tax.
9+
This submission improves the post-training quantization pipeline while keeping the training procedure identical to PR #1218. The core insight: #1218 loses ~0.012 BPB from quantization (pre-quant 1.1047 → post-quant 1.1162 non-sliding). Recovering even a fraction of that loss is free — zero training-time cost, zero throughput tax.
1010

1111
### Changes from #1218
1212

13-
1. **Sequential layer-wise GPTQ propagation**: Instead of quantizing each layer independently, propagate the reconstruction error of earlier layers through to later layers' Hessian estimates. This captures cross-layer error accumulation that vanilla per-layer GPTQ misses.
13+
#### 1. Sequential cross-layer GPTQ propagation (`collect_hessians_sequential`)
1414

15-
2. **Groupwise int6 scales** (group_size=128): Replace per-row scales with per-group scales, giving the quantizer finer control over weight distributions within each row. The scale overhead is small (~2% of weight size) but the MSE reduction is significant for layers with heterogeneous weight magnitudes.
15+
Instead of collecting all Hessians in a single pass and quantizing each layer independently, we process layers one at a time: collect Hessian for layer *i*, quantize it with GPTQ, **inject the quantized weights back into the model**, then collect the Hessian for layer *i+1*. This means later layers' Hessians reflect the actual quantized activations they'll see at eval time, capturing cross-layer error accumulation that per-layer GPTQ misses.
1616

17-
3. **Hessian-weighted scale selection**: Instead of searching over percentile-based clip candidates using MSE, select scales that minimize the Hessian-weighted quantization error `(W - Q)^T H (W - Q)`, which directly optimizes for output reconstruction quality.
17+
Controlled by `GPTQ_SEQUENTIAL=1` (default on).
18+
19+
#### 2. Groupwise int6 scales (`group_size=128`)
20+
21+
Replace per-row scales with per-group scales (128 columns per group). Each group of weights gets its own fp16 scale factor, giving the quantizer finer control over heterogeneous weight distributions within each row. The scale storage overhead is small (~2% of weight bytes) but the reconstruction error reduction is significant for layers with high weight variance.
22+
23+
Controlled by `GPTQ_GROUP_SIZE=128` (default).
24+
25+
#### 3. Hessian-weighted scale selection
26+
27+
For per-row mode, instead of selecting scales by minimizing MSE `(W - Q)^2`, we minimize the Hessian-weighted error `sum(H_diag * (W - Q)^2)`, which directly optimizes for output reconstruction quality. Columns with high Hessian diagonal (high activation variance) get proportionally more weight in the error metric.
1828

1929
### Why this is distinct
2030

21-
- **Not in PR #1218**: #1218 uses independent per-layer GPTQ with MSE-based scale search.
22-
- **Not in PR #1019**: #1019 focuses on self-generated calibration data; this improves the GPTQ algorithm itself.
31+
- **Not in PR #1218**: #1218 uses independent per-layer GPTQ with per-row scales and MSE-based selection.
32+
- **Not in PR #1019**: #1019 focuses on self-generated calibration data; this improves the quantization algorithm itself.
2333
- **Not in PR #1204 / #1209**: Those PRs focus on architecture changes (parallel residuals, depth recurrence, TTT).
24-
- **No SLOT / TTT**: This is a pure post-training compression improvement — clean causal submission.
34+
- **No SLOT / TTT**: Pure post-training compression improvement — clean causal submission.
2535

2636
### Expected improvement
2737

@@ -33,13 +43,29 @@ This submission improves the post-training quantization pipeline while keeping t
3343
| int6 compatible | Yes (this IS the int6 path) |
3444
| torch.compile compatible | Yes (post-training only) |
3545

46+
## Implementation details
47+
48+
New/modified functions:
49+
- `_compute_groupwise_scales()` — per-group fp16 scale computation
50+
- `_quantize_with_groupwise_scales()` — apply groupwise quantization
51+
- `_dequant_groupwise()` — reconstruct from groupwise int6
52+
- `collect_hessians_sequential()` — layer-by-layer Hessian collection with error propagation
53+
- `gptq_quantize_weight()` — extended with `group_size` param and Hessian-weighted error metric
54+
- `gptq_mixed_quantize_int6()` — passes `group_size`, stores group metadata
55+
- `dequantize_mixed_int6()` — handles 2D groupwise scale tensors
56+
57+
New hyperparameters:
58+
- `GPTQ_GROUP_SIZE=128` — columns per quantization group (0 = per-row fallback)
59+
- `GPTQ_SEQUENTIAL=1` — enable sequential cross-layer propagation
60+
- `GPTQ_RESERVE_SECONDS=12` — increased from 10 to account for sequential overhead
61+
3662
## Ablation Plan
3763

3864
1. **Baseline**: Reproduce #1218 at 3 seeds (1337, 42, 2025)
39-
2. **+Sequential propagation**: Add cross-layer error propagation only
40-
3. **+Groupwise scales**: Add group_size=128 scales on top
41-
4. **+Hessian-weighted selection**: Replace MSE scale search with H-weighted
42-
5. **Full stack**: All three combined
65+
2. **+Sequential propagation only**: `GPTQ_SEQUENTIAL=1 GPTQ_GROUP_SIZE=0`
66+
3. **+Groupwise scales only**: `GPTQ_SEQUENTIAL=0 GPTQ_GROUP_SIZE=128`
67+
4. **+Hessian-weighted only**: per-row mode with H-weighted error
68+
5. **Full stack**: All three combined (default config)
4369

4470
Acceptance criteria: paired t-test across 3 seeds, p < 0.01, dBPB > 0.003.
4571

0 commit comments

Comments
 (0)