[generate] skip compilation on cpu offload #37709

gante · 2025-04-23T14:03:12Z

What does this PR do?

torch.compile + model CPU offload is resulting in crashes. It should work in theory, but it's not working atm.

from transformers import AutoModelForCausalLM, AutoTokenizer

device_map = {"model.embed_tokens": 0, "model.layers.0": 0, "model.layers.1": "cpu", "model.norm": "cpu", "lm_head": 0}
model = AutoModelForCausalLM.from_pretrained(
    "hf-internal-testing/tiny-random-MistralForCausalLM", device_map=device_map
)
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-MistralForCausalLM")
tokenized_inputs = tokenizer(["Hello world"], return_tensors="pt")
input_ids = tokenized_inputs.input_ids.to(0)

# Uses a compilable cache -> compilation happens under the hood
output = model.generate(input_ids, max_new_tokens=20, cache_implementation="static")

This PR:

Moves the logic to trigger "auto compile" into its own function
Disables "auto compile" when there is CPU offload (and disk offload too, which is not expected to support torch.compile)
Adds a test to prevent regressions

github-actions · 2025-04-23T14:03:26Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

HuggingFaceDocBuilderDev · 2025-04-23T14:16:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks !

SunMarc · 2025-04-23T15:44:41Z

src/transformers/generation/utils.py

+        # Exception 1: Some quantization methods do not support compilation
+        if getattr(self, "hf_quantizer", None) is not None:
+            can_compile &= self.hf_quantizer.is_compileable
+
+        # Exception 2: Never compile if the model is using CPU offload (as of April 2025, this results in a crash)
+        if hasattr(self, "hf_device_map"):
+            all_model_devices = set(self.hf_device_map.values())
+            has_cpu_offload = "cpu" in all_model_devices and len(all_model_devices) > 1
+            can_compile &= not has_cpu_offload
+


thanks for adding there cases, maybe we should also check for disk offload ?

Good point! Adding too

zucchini-nlp

Nice, thanks for fixing!

One question, do we expose "auto-compile" in generate to users through config or is this still under the hood? We might raise a small warning in case users forced "auto-compile" but the model doesn't meet all criteria

gante · 2025-04-24T12:17:31Z

One question, do we expose "auto-compile" in generate to users through config or is this still under the hood?

There is generation_config.compile_config. Yes, agreed, we should throw a warning if it is set but we don't meet the conditions for compilation to happen -- adding a commit with it 👍

* skip compilation on cpu offload * add test * better logic * docstring * boolean logic * add disk offload check * warn users if compilation options are set but compilation doesn happen * fix test --------- Co-authored-by: Marc Sun <[email protected]>

skip compilation on cpu offload

7b10595

github-actions bot marked this pull request as draft April 23, 2025 14:03

gante marked this pull request as ready for review April 23, 2025 14:19

gante added 3 commits April 23, 2025 14:19

add test

4bad856

better logic

8b51bd3

docstring

2d42c8a

gante requested a review from zucchini-nlp April 23, 2025 14:30

gante and others added 2 commits April 23, 2025 15:35

Merge branch 'main' into do_not_compile_cpu_offload

ca25a4d

boolean logic

b915922

gante requested a review from SunMarc April 23, 2025 15:38

SunMarc approved these changes Apr 23, 2025

View reviewed changes

gante and others added 2 commits April 23, 2025 15:53

add disk offload check

8a4bd36

Merge branch 'main' into do_not_compile_cpu_offload

8be09b2

zucchini-nlp approved these changes Apr 24, 2025

View reviewed changes

gante added 2 commits April 24, 2025 12:30

warn users if compilation options are set but compilation doesn happen

028c55d

fix test

987850a

gante merged commit 8bdd4f2 into huggingface:main Apr 24, 2025
20 checks passed

gante deleted the do_not_compile_cpu_offload branch April 24, 2025 13:08

gante mentioned this pull request Apr 24, 2025

[generate] fix default autocompile case on gpu #37756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[generate] skip compilation on cpu offload #37709

[generate] skip compilation on cpu offload #37709

Uh oh!

gante commented Apr 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Apr 23, 2025

Uh oh!

gante Apr 23, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

gante commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[generate] skip compilation on cpu offload #37709

[generate] skip compilation on cpu offload #37709

Uh oh!

Conversation

gante commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

gante Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

gante commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gante commented Apr 23, 2025 •

edited

Loading