Skip to content

Conversation

@saraswatmks
Copy link

@saraswatmks saraswatmks commented Dec 21, 2025

SUMMARY:
-Refactored gpt oss calibration scheme use the MoECalibrationModule interface, this enables all-expert calibration (ensures all experts see the data).
-Added test scripts to demonstrate use of updated interface

TEST PLAN:
Changes were tested by successfully running the following commands:

python gpt_oss_quantization_example.py --algorithm gptq
python gpt_oss_quantization_example.py --algorithm awq
python gpt_oss_quantization_example.py --algorithm w4a8

Below are the successful log of the runs (all three quantizations):


[6/6] Testing generation with quantized model...
      Prompt: Hello, my name is
      Generated: Hello, my name is [Your Name], and I represent Sora. I am delighted to share with you how we empower businesses like yours, with a proven track record of delivering superior outcomes through innovative, automated solutions.

Sora's solution is built on AWS, a powerful
Generation test passed

Saving quantized model to: gpt-oss-20b-gptq
2025-12-20T21:34:49.744242+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 2400it [00:19, 123.44it/s]
Model saved successfully

======================================================================
Quantization Complete!
======================================================================
Quantized model saved to: gpt-oss-20b-gptq

To run inference with vLLM:
----------------------------------------------------------------------
from vllm import LLM, SamplingParams

model = LLM(model="gpt-oss-20b-gptq", trust_remote_code=True)
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = model.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
======================================================================
[6/6] Testing generation with quantized model...
      Prompt: Hello, my name is
      Generated: Hello, my name is {name_with_index} {name_with_index}. My name is my"

In other words, we have repeated parts, perhaps some mistakes.

Now focusing on the context, likely in "2.1 Maturity of the product", the "The world
Generation test passed

Saving quantized model to: gpt-oss-20b-awq
2025-12-21T06:10:48.437118+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 2400it [00:19, 125.81it/s]
Model saved successfully

======================================================================
Quantization Complete!
======================================================================
Quantized model saved to: gpt-oss-20b-awq

To run inference with vLLM:
----------------------------------------------------------------------
from vllm import LLM, SamplingParams

model = LLM(model="gpt-oss-20b-awq", trust_remote_code=True)
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = model.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
======================================================================
[6/6] Testing generation with quantized model...
      Prompt: Hello, my name is
      Generated: Hello, my name is " + name + " and I am " + age + " years old.";
    }
}

Thus answer. And the answer probably also expects in the test case, you can get the same by using pattern matching. So answer: use pattern
Generation test passed

Saving quantized model to: gpt-oss-20b-w4a8
2025-12-21T06:15:49.378501+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 2400it [00:16, 147.06it/s]
Model saved successfully

======================================================================
Quantization Complete!
======================================================================
Quantized model saved to: gpt-oss-20b-w4a8

To run inference with vLLM:
----------------------------------------------------------------------
from vllm import LLM, SamplingParams

model = LLM(model="gpt-oss-20b-w4a8", trust_remote_code=True)
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = model.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
======================================================================

Fixes #2159

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @saraswatmks, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the quantization process for GPT-OSS Mixture-of-Experts (MoE) models by refactoring the calibration scheme. It introduces a dedicated MoECalibrationModule interface and a new CalibrationLinearExperts class, which together enable a comprehensive "all-expert calibration." This ensures that during quantization, all experts within the MoE architecture are properly calibrated, leading to more accurate and robust quantized models. The changes are validated with a new example script demonstrating the improved workflow across various quantization algorithms.

Highlights

  • MoE Calibration Refactor: The GPT-OSS quantization scheme now uses the MoECalibrationModule interface, providing a standardized approach for calibrating Mixture-of-Experts (MoE) models.
  • All-Expert Calibration Enabled: This refactoring introduces an 'all-expert calibration' mode, ensuring that all experts within an MoE model are exposed to calibration data, which is crucial for more robust and accurate quantization.
  • New CalibrationLinearExperts Class: A new class, CalibrationLinearExperts, has been introduced. It wraps existing LinearExperts modules to facilitate the all-expert calibration process by controlling token flow during calibration.
  • Updated Model Conversion Function: The convert_model_for_quantization_gptoss function has been updated to accept a calibrate_all_experts flag, allowing users to explicitly enable or disable the all-expert calibration behavior during model transformation.
  • New Example Script: A new example script (gpt_oss_quantization_example.py) has been added to demonstrate the updated quantization workflow for W4A8, AWQ, and GPTQ algorithms, showcasing the use of the new calibration features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@saraswatmks
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the GPT-OSS quantization calibration to use the MoECalibrationModule interface, enabling all-expert calibration, which is a great improvement for MoE models. The changes are well-structured, and the new example script is very helpful for demonstrating the functionality.

My review includes a few suggestions to improve code clarity and maintainability:

  • In src/llmcompressor/modeling/gpt_oss.py, I've pointed out an unused parameter and significant code duplication that could be refactored.
  • In the new example script examples/quantization_w4a8/gpt_oss_quantization_example.py, I've identified a bug related to the output directory and some redundant code around the oneshot call and model saving.

Addressing these points will make the code cleaner and more robust. Overall, this is a solid contribution.

Comment on lines +378 to +364
calibration_experts = CalibrationLinearExperts(
original=linear_experts,
config=model.config,
calibrate_all_experts=calibrate_all_experts,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The config parameter is passed to CalibrationLinearExperts but is not used in its __init__ method. To improve code clarity, this parameter should be removed from both the CalibrationLinearExperts.__init__ signature and this call site.

        calibration_experts = CalibrationLinearExperts(
            original=linear_experts,
            calibrate_all_experts=calibrate_all_experts,
        )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abstract class MoECalibrationModule requires the subclass to follow the following format:

    Subclasses must:
    1. Implement `__init__()` with signature:
       (self, original, config, calibrate_all_experts=True)
    2. Set `is_permanent` to indicate if module should stay in calibration form
    3. Optionally implement `restore()` if is_permanent=False

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the GPT-OSS quantization to use an all-expert calibration scheme via the MoECalibrationModule interface, which is a solid improvement. The changes are well-structured and include a helpful example script. My feedback focuses on improving the new example script for better clarity and robustness, such as refining how command-line arguments are handled and simplifying the model saving logic. Additionally, I've identified an area of code duplication in the core modeling file that could be refactored to enhance maintainability.

@saraswatmks saraswatmks force-pushed the saraswatmks/gpt-refactor-quantization branch from ff2c7ce to fd81a2a Compare December 21, 2025 09:48
@saraswatmks
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the GPT-OSS quantization to use the MoECalibrationModule interface, enabling all-expert calibration. This is a solid architectural improvement. The core logic changes in src/llmcompressor/modeling/gpt_oss.py are well-implemented and robust. I've reviewed the new example script gpt_oss_quantization_example.py and found a high-severity issue where the model isn't being saved, which would make the example fail for users. I've provided a suggestion to fix this. I also included a minor suggestion to improve code consistency in the example script. Overall, great work on the refactoring.

tokenizer=use_tokenizer,
max_seq_length=max_seq_length,
num_calibration_samples=num_samples,
save_compressed=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The oneshot function is called with save_compressed=False, which prevents the quantized model from being saved to disk. However, subsequent print statements suggest that the model is saved and provide instructions on how to load it. This is contradictory and will cause the example to fail for users.

Please set save_compressed=True to ensure the quantized model is saved as intended by the example's instructions.

Suggested change
save_compressed=False,
save_compressed=True,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I've seen is, when using oneshot() with output_dir (instead of saving it externally with model.save_pretrained() , with default save_compressed=True, the saved compressed tensors to disk and may modify the in-memory model representation in a way that's incompatible with direct .generate() calls. Probably that's the reason enabling save_compressed=True throws the error:

"/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 313, in forward
    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.10/site-packages/compressed_tensors/quantization/lifecycle/forward.py", line 387, in wrapped_forward
    output = forward_func_orig.__get__(module, module.__class__)(
  File "/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 134, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Char

@saraswatmks saraswatmks force-pushed the saraswatmks/gpt-refactor-quantization branch from 1d97568 to 891ea25 Compare December 21, 2025 13:36
@kylesayrs
Copy link
Collaborator

Looks great! Would you mind adding some basic tests for your added source code to make sure that everything works as expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GPT-OSS] Expanded Support for Activation Quantization

2 participants