Enhance Autoround to support multiple cards tuning #2157

yiliu30 · 2025-12-19T06:16:26Z

Given AutoRound uses block‑level reconstruction loss to fine‑tune quantization parameters, which requires running backward passes on each block. For large model, like Qwen3-235B, a single GPU often doesn’t have enough memory to hold an entire block during backward computation. To address this, we use the HF accelerator to dispatch the module across multiple devices.
In this PR, we enable this feature on LLMC side:

Add device_ids for tuning with multiple cards
Map ignore to Autoround skipping layers
Add Qwen/Qwen3-235B-A22B as example for multiple cards

Test plan

pytest -svv ./llmcompressor/transformers/autoround/test_autoround_oneshot.py -k test_oneshot_with_device_map

Example results

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS1-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.928|±  |0.0082|
# |     |       |strict-match    |     5|exact_match|↑  |0.930|±  |0.0081|
  

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS200-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.934|±  |0.0079|
# |     |       |strict-match    |     5|exact_match|↑  |0.915|±  |0.0088|

cc @hshen14 @thuang6 @wenhuach21

Signed-off-by: yiliu30 <[email protected]>

…mpressor-fork into autoround-version

Signed-off-by: yiliu30 <[email protected]>

github-actions · 2025-12-19T06:16:34Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2025-12-19T06:16:44Z

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AutoRoundModifier by introducing robust support for multi-GPU tuning, allowing the quantization process to efficiently distribute model layers across multiple devices. This is achieved through the addition of a device_map parameter and a mechanism to temporarily suspend Accelerate's device management hooks, ensuring compatibility and optimized resource utilization during quantization. The changes are validated with a new test case and demonstrated with a practical example for the Qwen3 model.

Highlights

Multi-GPU Support for AutoRound: The AutoRoundModifier now includes a device_map parameter, enabling the AutoRound quantization process to distribute model layers and leverage multiple GPUs for more efficient tuning.
Accelerate Hook Management: A new suspend_accelerate_hooks context manager has been introduced. This temporarily detaches Accelerate's device offloading hooks during AutoRound's tuning phase, preventing conflicts and ensuring proper device management when using multiple GPUs.
Improved Unquantized Layer Handling: A get_unquantized_layer_names method was added, and the fp_layers parameter is now passed to the AutoRound constructor. This provides more precise control over which specific layers are excluded from the quantization process.
Qwen3 Example Added: A new example script (qwen3_example.py) has been added, demonstrating how to apply AutoRound quantization to the Qwen3-235B model using multiple A100 GPUs, showcasing the new multi-card tuning capability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances AutoRoundModifier to support multi-GPU tuning by integrating auto_round's device_map functionality. This is primarily achieved by adding a device_map parameter to the modifier and introducing a new context manager, suspend_accelerate_hooks, to correctly handle models with Hugging Face Accelerate hooks. The changes are well-supported by a new example for a large model and a new test case for multi-GPU execution. The implementation is solid, but I've identified a potential edge case in the new suspend_accelerate_hooks function that could lead to a crash if a model has no parameters, for which I've provided a suggestion.

src/llmcompressor/modifiers/autoround/base.py

brian-dellabetta

Hi AutoRound team, I think these changes make sense, though we are refactoring some things that overlap with these changes. Please see comments.

Can you point me to the logic in the auto round repo that handles the multi-gpu parallelization work? I'd like to see how you're handling it

brian-dellabetta · 2025-12-19T18:05:55Z

examples/autoround/qwen3_example.py

+    ],
+    iters=ITERS,
+    enable_torch_compile=False,
+    device_map="0,1,2,3",  # Use 4 A100 GPUs


I also think the name here is confusing, isn't device_map usually a different format like a dict that maps layer name to device id? If device_map="0,1,2,3" is valid in transformers, we can leave as is, otherwise device_ids may be a better name

Yes, the device_map is used in Transformers, and we follow a similar approach. Please refer to:
https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling#accelerate

I've only ever seen device_map be a string like "auto" or "sequential", or a dictionary mapping each module name to each device_id, like

device_map = {"block1": 0, "block2.linear1": 0, "block2.linear2": 1, "block2.linear3": 1}

What does it mean if device_map="0,1,2,3"? Is that like auto but only with the first 4 devices?

Reference: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference#designing-a-device-map

Oh, you're right. Updated to device_ids!

brian-dellabetta · 2025-12-19T18:06:55Z

src/llmcompressor/modifiers/autoround/base.py



+@contextmanager
+def suspend_accelerate_hooks(model: nn.Module):


just fyi, we are refactoring our usage of accelerate hooks for offloading. You can follow some of that in

[Offload] Switch to torch offloader from accelerate #2148

Thanks! I noticed that change as well. We can adapt to it once it’s ready.
And could I ask what motivated you to implement that functionality yourself instead of using the accelerator hooks? I imagine it requires quite a bit of engineering effort.

I don't think we have anything posted on the decision to move away from accelerate hooks, outside of it being a pain point and our usage of it being limited in scope. cc @kylesayrs in case there is any other information we can provide.

brian-dellabetta · 2025-12-19T18:11:46Z

src/llmcompressor/modifiers/autoround/base.py

                iters=self.iters,
                enable_torch_compile=self.enable_torch_compile,
                batch_size=self.batch_size,
+                device_map=self.device_map,


just fyi, we are looking into better parallelization support and will create an RFC in the new year to gather feedback on best approaches. See PR and comment here

Thanks! I’ve looked through some of the multi‑card discussions in LLMC, and they’re quite insightful.
In AutoRound, we currently chose the accelerator hooks because they’re general enough to work across most models without requiring explicit cross‑card communication ops or modeling changes. The downside, of course, is some communication overhead and limited overlap, which can affect performance.

We’re also exploring more efficient ways to fully squeeze out GPU performance. Looking forward to the RFC from you all, hope it covers the tuning case if possible!

we will be sure to share it out in the new year.

Can you elaborate on what you mean by the tuning case? Is this specific to the tuning stage mentioned in the SignRoundv2 paper?

Thanks! The tuning here refers to fine‑tuning the quantization parameters by evaluating the block‑wise reconstruction error. In this process, we compute the loss between the original floating‑point model and the Q‑DQ model, and then run a backward pass to update the gradients of the quantization parameters accordingly. This approach was introduced in SignRound v1. cc @wenhuach

For implementation details, please refer to the code here. https://github.com/intel/auto-round/blob/440288fd6b92509e84da337437a30997ac544735/auto_round/compressors/base.py#L2984

yiliu30 · 2025-12-20T07:05:07Z

Hi AutoRound team, I think these changes make sense, though we are refactoring some things that overlap with these changes. Please see comments.

Can you point me to the logic in the auto round repo that handles the multi-gpu parallelization work? I'd like to see how you're handling it

Hi @brian-dellabetta , here is the logic for multi-gpu devices, https://github.com/intel/auto-round/blob/b53ead7d77746385d700152c7f00960f18fb9d85/auto_round/compressors/base.py#L1560-L1562.

We take a block, its input, and the list of available devices, then assign each submodule to one of those devices. The accelerator’s AlignDevicesHook later used for dispatching the submodules accordingly.

Inside set_auto_device_map_for_block_with_tuning, we estimate the block’s memory requirements based on its parameters, input, batch size, and a few heuristic factors. Using this estimate, we assign devices to the submodules to make memory usage stays as balanced as possible across all GPUs. The final mapping is then attached to each module as its tuning_device.

Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 26 commits November 20, 2025 19:04

update autoround version

bc97d48

Signed-off-by: yiliu30 <[email protected]>

Merge branch 'main' into autoround-version

19ab4f2

expose bs

9ba113c

Signed-off-by: yiliu30 <[email protected]>

Merge branch 'autoround-version' of https://github.com/yiliu30/llm-co…

646982a

…mpressor-fork into autoround-version

use 0.9.1

1050335

Signed-off-by: yiliu30 <[email protected]>

fix

50e6682

Signed-off-by: yiliu30 <[email protected]>

update

d139071

Signed-off-by: yiliu30 <[email protected]>

enable auto-dispatch

a0affbd

Signed-off-by: yiliu30 <[email protected]>

add ds example

17ba9f5

Signed-off-by: yiliu30 <[email protected]>

merge main

cd943cd

Signed-off-by: yiliu30 <[email protected]>

pass ignore to ar

8338ed5

Signed-off-by: yiliu30 <[email protected]>

add qwen example

56515af

Signed-off-by: yiliu30 <[email protected]>

update example

ad6c1c0

Signed-off-by: yiliu30 <[email protected]>

format

09a72c0

Signed-off-by: yiliu30 <[email protected]>

update

af112bd

Signed-off-by: yiliu30 <[email protected]>

refine suspend hook

ec98118

Signed-off-by: yiliu30 <[email protected]>

update

c5eae60

Signed-off-by: yiliu30 <[email protected]>

clean code

2d482fc

Signed-off-by: yiliu30 <[email protected]>

add ut

17b7e45

Signed-off-by: yiliu30 <[email protected]>

fix

7a9b3cd

Signed-off-by: yiliu30 <[email protected]>

fix hint

4f45b17

Signed-off-by: yiliu30 <[email protected]>

refine

0fac601

Signed-off-by: yiliu30 <[email protected]>

speedup ut

0f7a990

Signed-off-by: yiliu30 <[email protected]>

clean

58ef017

Signed-off-by: yiliu30 <[email protected]>

add docstring

c9ea99c

Signed-off-by: yiliu30 <[email protected]>

format

d2a7c92

Signed-off-by: yiliu30 <[email protected]>

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

src/llmcompressor/modifiers/autoround/base.py Show resolved Hide resolved

yiliu30 commented Dec 19, 2025

View reviewed changes

src/llmcompressor/modifiers/autoround/base.py Show resolved Hide resolved

yiliu30 marked this pull request as ready for review December 19, 2025 06:20

Merge branch 'main' into auto-device

d48c3d6

yiliu30 mentioned this pull request Dec 19, 2025

[LLMC] Enable auto-device intel/auto-round#1169

Open

dsikka added the autoround For any PR / issue related to autoround support label Dec 19, 2025

brian-dellabetta reviewed Dec 19, 2025

View reviewed changes

Merge branch 'main' into auto-device

993a68e

yiliu30 requested a review from brian-dellabetta December 22, 2025 01:48

yiliu30 added 3 commits December 23, 2025 08:32

Merge branch 'main' into auto-device

fa8cdcc

rename device_map to device_ids

c17e923

Signed-off-by: yiliu30 <[email protected]>

fix typo

1092cde

Signed-off-by: yiliu30 <[email protected]>



		@contextmanager
		def suspend_accelerate_hooks(model: nn.Module):

Enhance Autoround to support multiple cards tuning #2157

Are you sure you want to change the base?

Enhance Autoround to support multiple cards tuning #2157

Conversation

yiliu30 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Example results

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiliu30 Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiliu30 commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiliu30 commented Dec 19, 2025 •

edited

Loading

yiliu30 Dec 20, 2025 •

edited

Loading

yiliu30 Dec 20, 2025 •

edited

Loading

brian-dellabetta Dec 23, 2025 •

edited

Loading

yiliu30 Dec 24, 2025 •

edited

Loading