Skip to content

Conversation

@yao-matrix
Copy link
Contributor

@yao-matrix yao-matrix commented Mar 12, 2025

XPU are aligning features in PyTorch w/ CUDA. Since PyTorch 2.6, an device agnostic torch.Stream is supported and XPU support this API. So, I enabled OffloadedCache on XPU.

Why start from 2.7? The reason is OffloadedCache needs StreamContext, but the PR to support __enter__ attribute of StreamContext is not merged in 2.6, but will be in 2.7.

Tested w/ PyTorch 2.7 dev package(pip install --pre torch==2.7.0.dev20250306 --index-url https://download.pytorch.org/whl/nightly/xpu).

@github-actions github-actions bot marked this pull request as draft March 12, 2025 04:11
@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@yao-matrix yao-matrix marked this pull request as ready for review March 12, 2025 04:31
Signed-off-by: root <[email protected]>
@yao-matrix
Copy link
Contributor Author

the ci failed cases seems irrelevant to my changes.

@ydshieh
Copy link
Collaborator

ydshieh commented Mar 13, 2025

Hi @yao-matrix Thank you for make this supported.

Hi @n17s, are you interested to take a first look? cc @gante

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me overall !

@SunMarc SunMarc requested a review from gante March 14, 2025 13:25
Copy link
Contributor

@n17s n17s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for adding support! 🤗

Added a minor nit with a more recent import guard practice, happy to merge when it's sorted

Signed-off-by: root <[email protected]>
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM !

@SunMarc SunMarc requested a review from gante March 18, 2025 17:35
@SunMarc SunMarc merged commit b11050d into huggingface:main Mar 19, 2025
21 checks passed
@loadams
Copy link
Contributor

loadams commented Mar 19, 2025

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

@yao-matrix
Copy link
Contributor Author

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

@loadams
Copy link
Contributor

loadams commented Mar 19, 2025

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

@yao-matrix - yes that is quite odd, but I was able to bisect the failure to this PR, so perhaps it is another code path that this PR is enabling that I'm hitting it from? But it does seem to be resolved by updating the torch version.

github-merge-queue bot pushed a commit to deepspeedai/DeepSpeed that referenced this pull request Mar 19, 2025
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <[email protected]>
@yao-matrix yao-matrix deleted the offloadedcache branch March 19, 2025 23:49
@gante
Copy link
Contributor

gante commented Mar 20, 2025

@yao-matrix I'm going to revert part of the changes in is_torch_greater_or_equal, as it is breaking in other parts of the library. In a nutshell, we can't confirm that all dev versions for 2.X contain the features that will be release in 2.X, which is the error @loadams is seeing (2.5.0a0+b465a5843b.nv24.9 is a dev version of 2.5.0).

@yao-matrix to enable your use case I'm going to add an accept_dev flag to is_torch_greater_or_equal

mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Mar 20, 2025
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <[email protected]>
loadams added a commit to deepspeedai/DeepSpeed that referenced this pull request Mar 25, 2025
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <[email protected]>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model

* follow Marc's suggestion to use _tie_weights to fix

Signed-off-by: Yao, Matrix <[email protected]>

* enable OffloadedCache on XPU since PyTorch 2.7

Signed-off-by: Yao, Matrix <[email protected]>

* fix style

Signed-off-by: Yao, Matrix <[email protected]>

* don't change bart

Signed-off-by: root <[email protected]>

* make code more concise per review comments

Signed-off-by: N <[email protected]>

* fix review comments

Signed-off-by: root <[email protected]>

* Revert "fix review comments"

This reverts commit acf1484.

* fix review comments

Signed-off-by: root <[email protected]>

* fix style

Signed-off-by: root <[email protected]>

---------

Signed-off-by: Yao, Matrix <[email protected]>
Signed-off-by: root <[email protected]>
Signed-off-by: N <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
ys950902 pushed a commit to ys950902/DeepSpeed that referenced this pull request May 21, 2025
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: yisheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants