-
Notifications
You must be signed in to change notification settings - Fork 347
Refactor gpt oss quantization use all expert quantization #2164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Refactor gpt oss quantization use all expert quantization #2164
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @saraswatmks, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the quantization process for GPT-OSS Mixture-of-Experts (MoE) models by refactoring the calibration scheme. It introduces a dedicated Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the GPT-OSS quantization calibration to use the MoECalibrationModule interface, enabling all-expert calibration, which is a great improvement for MoE models. The changes are well-structured, and the new example script is very helpful for demonstrating the functionality.
My review includes a few suggestions to improve code clarity and maintainability:
- In
src/llmcompressor/modeling/gpt_oss.py, I've pointed out an unused parameter and significant code duplication that could be refactored. - In the new example script
examples/quantization_w4a8/gpt_oss_quantization_example.py, I've identified a bug related to the output directory and some redundant code around theoneshotcall and model saving.
Addressing these points will make the code cleaner and more robust. Overall, this is a solid contribution.
| calibration_experts = CalibrationLinearExperts( | ||
| original=linear_experts, | ||
| config=model.config, | ||
| calibrate_all_experts=calibrate_all_experts, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config parameter is passed to CalibrationLinearExperts but is not used in its __init__ method. To improve code clarity, this parameter should be removed from both the CalibrationLinearExperts.__init__ signature and this call site.
calibration_experts = CalibrationLinearExperts(
original=linear_experts,
calibrate_all_experts=calibrate_all_experts,
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The abstract class MoECalibrationModule requires the subclass to follow the following format:
Subclasses must:
1. Implement `__init__()` with signature:
(self, original, config, calibrate_all_experts=True)
2. Set `is_permanent` to indicate if module should stay in calibration form
3. Optionally implement `restore()` if is_permanent=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the GPT-OSS quantization to use an all-expert calibration scheme via the MoECalibrationModule interface, which is a solid improvement. The changes are well-structured and include a helpful example script. My feedback focuses on improving the new example script for better clarity and robustness, such as refining how command-line arguments are handled and simplifying the model saving logic. Additionally, I've identified an area of code duplication in the core modeling file that could be refactored to enhance maintainability.
ff2c7ce to
fd81a2a
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the GPT-OSS quantization to use the MoECalibrationModule interface, enabling all-expert calibration. This is a solid architectural improvement. The core logic changes in src/llmcompressor/modeling/gpt_oss.py are well-implemented and robust. I've reviewed the new example script gpt_oss_quantization_example.py and found a high-severity issue where the model isn't being saved, which would make the example fail for users. I've provided a suggestion to fix this. I also included a minor suggestion to improve code consistency in the example script. Overall, great work on the refactoring.
| tokenizer=use_tokenizer, | ||
| max_seq_length=max_seq_length, | ||
| num_calibration_samples=num_samples, | ||
| save_compressed=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The oneshot function is called with save_compressed=False, which prevents the quantized model from being saved to disk. However, subsequent print statements suggest that the model is saved and provide instructions on how to load it. This is contradictory and will cause the example to fail for users.
Please set save_compressed=True to ensure the quantized model is saved as intended by the example's instructions.
| save_compressed=False, | |
| save_compressed=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I've seen is, when using oneshot() with output_dir (instead of saving it externally with model.save_pretrained() , with default save_compressed=True, the saved compressed tensors to disk and may modify the in-memory model representation in a way that's incompatible with direct .generate() calls. Probably that's the reason enabling save_compressed=True throws the error:
"/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 313, in forward
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/lib/python3.10/site-packages/compressed_tensors/quantization/lifecycle/forward.py", line 387, in wrapped_forward
output = forward_func_orig.__get__(module, module.__class__)(
File "/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 134, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Char
Signed-off-by: Manish <[email protected]>
Signed-off-by: Manish <[email protected]>
Signed-off-by: Manish <[email protected]>
Signed-off-by: Manish <[email protected]>
Signed-off-by: Manish <[email protected]>
Signed-off-by: Manish <[email protected]>
Signed-off-by: Manish <[email protected]>
1d97568 to
891ea25
Compare
|
Looks great! Would you mind adding some basic tests for your added source code to make sure that everything works as expected? |
Signed-off-by: Manish <[email protected]>
SUMMARY:
-Refactored gpt oss calibration scheme use the MoECalibrationModule interface, this enables all-expert calibration (ensures all experts see the data).
-Added test scripts to demonstrate use of updated interface
TEST PLAN:
Changes were tested by successfully running the following commands:
python gpt_oss_quantization_example.py --algorithm gptqpython gpt_oss_quantization_example.py --algorithm awqpython gpt_oss_quantization_example.py --algorithm w4a8Below are the successful log of the runs (all three quantizations):
Fixes #2159