Skip to content

Conversation

@gante
Copy link
Contributor

@gante gante commented Apr 23, 2025

What does this PR do?

torch.compile + model CPU offload is resulting in crashes. It should work in theory, but it's not working atm.

from transformers import AutoModelForCausalLM, AutoTokenizer

device_map = {"model.embed_tokens": 0, "model.layers.0": 0, "model.layers.1": "cpu", "model.norm": "cpu", "lm_head": 0}
model = AutoModelForCausalLM.from_pretrained(
    "hf-internal-testing/tiny-random-MistralForCausalLM", device_map=device_map
)
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-MistralForCausalLM")
tokenized_inputs = tokenizer(["Hello world"], return_tensors="pt")
input_ids = tokenized_inputs.input_ids.to(0)

# Uses a compilable cache -> compilation happens under the hood
output = model.generate(input_ids, max_new_tokens=20, cache_implementation="static")

This PR:

  1. Moves the logic to trigger "auto compile" into its own function
  2. Disables "auto compile" when there is CPU offload (and disk offload too, which is not expected to support torch.compile)
  3. Adds a test to prevent regressions

@github-actions github-actions bot marked this pull request as draft April 23, 2025 14:03
@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@gante gante marked this pull request as ready for review April 23, 2025 14:19
@gante gante requested a review from zucchini-nlp April 23, 2025 14:30
@gante gante requested a review from SunMarc April 23, 2025 15:38
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

Comment on lines 2115 to 2124
# Exception 1: Some quantization methods do not support compilation
if getattr(self, "hf_quantizer", None) is not None:
can_compile &= self.hf_quantizer.is_compileable

# Exception 2: Never compile if the model is using CPU offload (as of April 2025, this results in a crash)
if hasattr(self, "hf_device_map"):
all_model_devices = set(self.hf_device_map.values())
has_cpu_offload = "cpu" in all_model_devices and len(all_model_devices) > 1
can_compile &= not has_cpu_offload

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding there cases, maybe we should also check for disk offload ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Adding too

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for fixing!

One question, do we expose "auto-compile" in generate to users through config or is this still under the hood? We might raise a small warning in case users forced "auto-compile" but the model doesn't meet all criteria

@gante
Copy link
Contributor Author

gante commented Apr 24, 2025

One question, do we expose "auto-compile" in generate to users through config or is this still under the hood?

There is generation_config.compile_config. Yes, agreed, we should throw a warning if it is set but we don't meet the conditions for compilation to happen -- adding a commit with it 👍

@gante gante merged commit 8bdd4f2 into huggingface:main Apr 24, 2025
20 checks passed
@gante gante deleted the do_not_compile_cpu_offload branch April 24, 2025 13:08
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* skip compilation on cpu offload

* add test

* better logic

* docstring

* boolean logic

* add disk offload check

* warn users if compilation options are set but compilation doesn happen

* fix test

---------

Co-authored-by: Marc Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants