-
Notifications
You must be signed in to change notification settings - Fork 346
Enhance Autoround to support multiple cards tuning #2157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enhances AutoRoundModifier to support multi-GPU tuning by integrating auto_round's device_map functionality. This is primarily achieved by adding a device_map parameter to the modifier and introducing a new context manager, suspend_accelerate_hooks, to correctly handle models with Hugging Face Accelerate hooks. The changes are well-supported by a new example for a large model and a new test case for multi-GPU execution. The implementation is solid, but I've identified a potential edge case in the new suspend_accelerate_hooks function that could lead to a crash if a model has no parameters, for which I've provided a suggestion.
brian-dellabetta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi AutoRound team, I think these changes make sense, though we are refactoring some things that overlap with these changes. Please see comments.
Can you point me to the logic in the auto round repo that handles the multi-gpu parallelization work? I'd like to see how you're handling it
examples/autoround/qwen3_example.py
Outdated
| ], | ||
| iters=ITERS, | ||
| enable_torch_compile=False, | ||
| device_map="0,1,2,3", # Use 4 A100 GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the name here is confusing, isn't device_map usually a different format like a dict that maps layer name to device id? If device_map="0,1,2,3" is valid in transformers, we can leave as is, otherwise device_ids may be a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the device_map is used in Transformers, and we follow a similar approach. Please refer to:
https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling#accelerate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only ever seen device_map be a string like "auto" or "sequential", or a dictionary mapping each module name to each device_id, like
device_map = {"block1": 0, "block2.linear1": 0, "block2.linear2": 1, "block2.linear3": 1}What does it mean if device_map="0,1,2,3"? Is that like auto but only with the first 4 devices?
Reference: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference#designing-a-device-map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right. Updated to device_ids!
|
|
||
|
|
||
| @contextmanager | ||
| def suspend_accelerate_hooks(model: nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fyi, we are refactoring our usage of accelerate hooks for offloading. You can follow some of that in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I noticed that change as well. We can adapt to it once it’s ready.
And could I ask what motivated you to implement that functionality yourself instead of using the accelerator hooks? I imagine it requires quite a bit of engineering effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have anything posted on the decision to move away from accelerate hooks, outside of it being a pain point and our usage of it being limited in scope. cc @kylesayrs in case there is any other information we can provide.
| iters=self.iters, | ||
| enable_torch_compile=self.enable_torch_compile, | ||
| batch_size=self.batch_size, | ||
| device_map=self.device_map, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fyi, we are looking into better parallelization support and will create an RFC in the new year to gather feedback on best approaches. See PR and comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I’ve looked through some of the multi‑card discussions in LLMC, and they’re quite insightful.
In AutoRound, we currently chose the accelerator hooks because they’re general enough to work across most models without requiring explicit cross‑card communication ops or modeling changes. The downside, of course, is some communication overhead and limited overlap, which can affect performance.
We’re also exploring more efficient ways to fully squeeze out GPU performance. Looking forward to the RFC from you all, hope it covers the tuning case if possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will be sure to share it out in the new year.
Can you elaborate on what you mean by the tuning case? Is this specific to the tuning stage mentioned in the SignRoundv2 paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The tuning here refers to fine‑tuning the quantization parameters by evaluating the block‑wise reconstruction error. In this process, we compute the loss between the original floating‑point model and the Q‑DQ model, and then run a backward pass to update the gradients of the quantization parameters accordingly. This approach was introduced in SignRound v1. cc @wenhuach
For implementation details, please refer to the code here. https://github.com/intel/auto-round/blob/440288fd6b92509e84da337437a30997ac544735/auto_round/compressors/base.py#L2984
Hi @brian-dellabetta , here is the logic for multi-gpu devices, https://github.com/intel/auto-round/blob/b53ead7d77746385d700152c7f00960f18fb9d85/auto_round/compressors/base.py#L1560-L1562. We take a block, its input, and the list of available devices, then assign each submodule to one of those devices. The accelerator’s Inside set_auto_device_map_for_block_with_tuning, we estimate the block’s memory requirements based on its parameters, input, batch size, and a few heuristic factors. Using this estimate, we assign devices to the submodules to make memory usage stays as balanced as possible across all GPUs. The final mapping is then attached to each module as its |
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Given AutoRound uses block‑level reconstruction loss to fine‑tune quantization parameters, which requires running backward passes on each block. For large model, like Qwen3-235B, a single GPU often doesn’t have enough memory to hold an entire block during backward computation. To address this, we use the HF accelerator to dispatch the module across multiple devices.
In this PR, we enable this feature on LLMC side:
device_idsfor tuning with multiple cardsignoreto Autoround skipping layersQwen/Qwen3-235B-A22Bas example for multiple cardsTest plan
Example results
cc @hshen14 @thuang6 @wenhuach21