Refactor gpt oss quantization use all expert quantization #2164

saraswatmks · 2025-12-21T06:48:46Z

SUMMARY:
-Refactored gpt oss calibration scheme use the MoECalibrationModule interface, this enables all-expert calibration (ensures all experts see the data).
-Added test scripts to demonstrate use of updated interface

TEST PLAN:
Changes were tested by successfully running the following commands:

python gpt_oss_quantization_example.py --algorithm gptq
python gpt_oss_quantization_example.py --algorithm awq
python gpt_oss_quantization_example.py --algorithm w4a8

Below are the successful log of the runs (all three quantizations):


[6/6] Testing generation with quantized model...
      Prompt: Hello, my name is
      Generated: Hello, my name is [Your Name], and I represent Sora. I am delighted to share with you how we empower businesses like yours, with a proven track record of delivering superior outcomes through innovative, automated solutions.

Sora's solution is built on AWS, a powerful
Generation test passed

Saving quantized model to: gpt-oss-20b-gptq
2025-12-20T21:34:49.744242+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 2400it [00:19, 123.44it/s]
Model saved successfully

======================================================================
Quantization Complete!
======================================================================
Quantized model saved to: gpt-oss-20b-gptq

To run inference with vLLM:
----------------------------------------------------------------------
from vllm import LLM, SamplingParams

model = LLM(model="gpt-oss-20b-gptq", trust_remote_code=True)
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = model.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
======================================================================

[6/6] Testing generation with quantized model...
      Prompt: Hello, my name is
      Generated: Hello, my name is {name_with_index} {name_with_index}. My name is my"

In other words, we have repeated parts, perhaps some mistakes.

Now focusing on the context, likely in "2.1 Maturity of the product", the "The world
Generation test passed

Saving quantized model to: gpt-oss-20b-awq
2025-12-21T06:10:48.437118+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 2400it [00:19, 125.81it/s]
Model saved successfully

======================================================================
Quantization Complete!
======================================================================
Quantized model saved to: gpt-oss-20b-awq

To run inference with vLLM:
----------------------------------------------------------------------
from vllm import LLM, SamplingParams

model = LLM(model="gpt-oss-20b-awq", trust_remote_code=True)
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = model.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
======================================================================

[6/6] Testing generation with quantized model...
      Prompt: Hello, my name is
      Generated: Hello, my name is " + name + " and I am " + age + " years old.";
    }
}

Thus answer. And the answer probably also expects in the test case, you can get the same by using pattern matching. So answer: use pattern
Generation test passed

Saving quantized model to: gpt-oss-20b-w4a8
2025-12-21T06:15:49.378501+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 2400it [00:16, 147.06it/s]
Model saved successfully

======================================================================
Quantization Complete!
======================================================================
Quantized model saved to: gpt-oss-20b-w4a8

To run inference with vLLM:
----------------------------------------------------------------------
from vllm import LLM, SamplingParams

model = LLM(model="gpt-oss-20b-w4a8", trust_remote_code=True)
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = model.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
======================================================================

Fixes #2159

github-actions · 2025-12-21T06:48:55Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2025-12-21T06:49:02Z

Summary of Changes

Hello @saraswatmks, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the quantization process for GPT-OSS Mixture-of-Experts (MoE) models by refactoring the calibration scheme. It introduces a dedicated MoECalibrationModule interface and a new CalibrationLinearExperts class, which together enable a comprehensive "all-expert calibration." This ensures that during quantization, all experts within the MoE architecture are properly calibrated, leading to more accurate and robust quantized models. The changes are validated with a new example script demonstrating the improved workflow across various quantization algorithms.

Highlights

MoE Calibration Refactor: The GPT-OSS quantization scheme now uses the MoECalibrationModule interface, providing a standardized approach for calibrating Mixture-of-Experts (MoE) models.
All-Expert Calibration Enabled: This refactoring introduces an 'all-expert calibration' mode, ensuring that all experts within an MoE model are exposed to calibration data, which is crucial for more robust and accurate quantization.
New CalibrationLinearExperts Class: A new class, CalibrationLinearExperts, has been introduced. It wraps existing LinearExperts modules to facilitate the all-expert calibration process by controlling token flow during calibration.
Updated Model Conversion Function: The convert_model_for_quantization_gptoss function has been updated to accept a calibrate_all_experts flag, allowing users to explicitly enable or disable the all-expert calibration behavior during model transformation.
New Example Script: A new example script (gpt_oss_quantization_example.py) has been added to demonstrate the updated quantization workflow for W4A8, AWQ, and GPTQ algorithms, showcasing the use of the new calibration features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

saraswatmks · 2025-12-21T06:50:13Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the GPT-OSS quantization calibration to use the MoECalibrationModule interface, enabling all-expert calibration, which is a great improvement for MoE models. The changes are well-structured, and the new example script is very helpful for demonstrating the functionality.

My review includes a few suggestions to improve code clarity and maintainability:

In src/llmcompressor/modeling/gpt_oss.py, I've pointed out an unused parameter and significant code duplication that could be refactored.
In the new example script examples/quantization_w4a8/gpt_oss_quantization_example.py, I've identified a bug related to the output directory and some redundant code around the oneshot call and model saving.

Addressing these points will make the code cleaner and more robust. Overall, this is a solid contribution.

examples/quantization_w4a8/gpt_oss_quantization_example.py

src/llmcompressor/modeling/gpt_oss.py

gemini-code-assist · 2025-12-21T06:50:36Z

src/llmcompressor/modeling/gpt_oss.py

+        calibration_experts = CalibrationLinearExperts(
+            original=linear_experts,
+            config=model.config,
+            calibrate_all_experts=calibrate_all_experts,
+        )


The config parameter is passed to CalibrationLinearExperts but is not used in its __init__ method. To improve code clarity, this parameter should be removed from both the CalibrationLinearExperts.__init__ signature and this call site.

calibration_experts = CalibrationLinearExperts( original=linear_experts, calibrate_all_experts=calibrate_all_experts, )

The abstract class MoECalibrationModule requires the subclass to follow the following format:

Subclasses must: 1. Implement `__init__()` with signature: (self, original, config, calibrate_all_experts=True) 2. Set `is_permanent` to indicate if module should stay in calibration form 3. Optionally implement `restore()` if is_permanent=False

gemini-code-assist

Code Review

This pull request refactors the GPT-OSS quantization to use an all-expert calibration scheme via the MoECalibrationModule interface, which is a solid improvement. The changes are well-structured and include a helpful example script. My feedback focuses on improving the new example script for better clarity and robustness, such as refining how command-line arguments are handled and simplifying the model saving logic. Additionally, I've identified an area of code duplication in the core modeling file that could be refactored to enhance maintainability.

examples/quantization_w4a8/gpt_oss_quantization_example.py

src/llmcompressor/modeling/gpt_oss.py

saraswatmks · 2025-12-21T13:18:22Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the GPT-OSS quantization to use the MoECalibrationModule interface, enabling all-expert calibration. This is a solid architectural improvement. The core logic changes in src/llmcompressor/modeling/gpt_oss.py are well-implemented and robust. I've reviewed the new example script gpt_oss_quantization_example.py and found a high-severity issue where the model isn't being saved, which would make the example fail for users. I've provided a suggestion to fix this. I also included a minor suggestion to improve code consistency in the example script. Overall, great work on the refactoring.

gemini-code-assist · 2025-12-21T13:19:29Z

examples/quantization_w4a8/gpt_oss_quantization_example.py

+        tokenizer=use_tokenizer,
+        max_seq_length=max_seq_length,
+        num_calibration_samples=num_samples,
+        save_compressed=False,


The oneshot function is called with save_compressed=False, which prevents the quantized model from being saved to disk. However, subsequent print statements suggest that the model is saved and provide instructions on how to load it. This is contradictory and will cause the example to fail for users.

Please set save_compressed=True to ensure the quantized model is saved as intended by the example's instructions.

Suggested change

save_compressed=False,

save_compressed=True,

The problem I've seen is, when using oneshot() with output_dir (instead of saving it externally with model.save_pretrained() , with default save_compressed=True, the saved compressed tensors to disk and may modify the in-memory model representation in a way that's incompatible with direct .generate() calls. Probably that's the reason enabling save_compressed=True throws the error:

"/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 313, in forward query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2) File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) File "/lib/python3.10/site-packages/compressed_tensors/quantization/lifecycle/forward.py", line 387, in wrapped_forward output = forward_func_orig.__get__(module, module.__class__)( File "/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 134, in forward return F.linear(input, self.weight, self.bias) RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Char

examples/quantization_w4a8/gpt_oss_quantization_example.py

Signed-off-by: Manish <[email protected]>

kylesayrs · 2025-12-21T22:49:08Z

Looks great! Would you mind adding some basic tests for your added source code to make sure that everything works as expected?

Signed-off-by: Manish <[email protected]>

gemini-code-assist bot reviewed Dec 21, 2025

View reviewed changes

saraswatmks force-pushed the saraswatmks/gpt-refactor-quantization branch from ff2c7ce to fd81a2a Compare December 21, 2025 09:48

gemini-code-assist bot reviewed Dec 21, 2025

View reviewed changes

saraswatmks added 7 commits December 21, 2025 14:36

Refactor gpt oss quantization use all exp calibration

61bd2d4

Signed-off-by: Manish <[email protected]>

Fix dimension mismatch

016c0c7

Signed-off-by: Manish <[email protected]>

Organise shared configs by diff quantization schemes

50c950a

Signed-off-by: Manish <[email protected]>

Use enum class for type safe quantization

6602afa

Signed-off-by: Manish <[email protected]>

Split forward method and create shared methods

22f453b

Signed-off-by: Manish <[email protected]>

Add gemini suggestions

572065f

Signed-off-by: Manish <[email protected]>

Replace str with enum representation

891ea25

Signed-off-by: Manish <[email protected]>

saraswatmks force-pushed the saraswatmks/gpt-refactor-quantization branch from 1d97568 to 891ea25 Compare December 21, 2025 13:36

Add test for gpt_oss calibration

dafa58b

Signed-off-by: Manish <[email protected]>

brian-dellabetta requested a review from kylesayrs December 22, 2025 16:18

Merge branch 'main' into saraswatmks/gpt-refactor-quantization

0d9bd96

Refactor gpt oss quantization use all expert quantization #2164

Are you sure you want to change the base?

Refactor gpt oss quantization use all expert quantization #2164

Conversation

saraswatmks commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 21, 2025

Uh oh!

gemini-code-assist bot commented Dec 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

saraswatmks commented Dec 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

saraswatmks Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saraswatmks commented Dec 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

saraswatmks Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saraswatmks commented Dec 21, 2025 •

edited

Loading