enable OffloadedCache on XPU from PyTorch 2.7 #36654

yao-matrix · 2025-03-12T04:11:44Z

XPU are aligning features in PyTorch w/ CUDA. Since PyTorch 2.6, an device agnostic torch.Stream is supported and XPU support this API. So, I enabled OffloadedCache on XPU.

Why start from 2.7? The reason is OffloadedCache needs StreamContext, but the PR to support __enter__ attribute of StreamContext is not merged in 2.6, but will be in 2.7.

Tested w/ PyTorch 2.7 dev package(pip install --pre torch==2.7.0.dev20250306 --index-url https://download.pytorch.org/whl/nightly/xpu).

…tionalGeneration model

Signed-off-by: Yao, Matrix <[email protected]>

github-actions · 2025-03-12T04:11:57Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

Signed-off-by: Yao, Matrix <[email protected]>

Signed-off-by: root <[email protected]>

yao-matrix · 2025-03-12T07:20:24Z

the ci failed cases seems irrelevant to my changes.

ydshieh · 2025-03-13T11:27:23Z

Hi @yao-matrix Thank you for make this supported.

Hi @n17s, are you interested to take a first look? cc @gante

SunMarc

Looks fine to me overall !

src/transformers/cache_utils.py

tests/utils/test_cache_utils.py

Signed-off-by: N <[email protected]>

n17s

Looks good to me

gante

LGTM, thank you for adding support! 🤗

Added a minor nit with a more recent import guard practice, happy to merge when it's sorted

src/transformers/cache_utils.py

Signed-off-by: root <[email protected]>

This reverts commit acf1484.

Signed-off-by: root <[email protected]>

SunMarc

Thanks ! LGTM !

loadams · 2025-03-19T21:54:05Z

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

yao-matrix · 2025-03-19T23:24:15Z

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:
ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)
Perhaps the guards are on the wrong version of pytorch?

it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

loadams · 2025-03-19T23:40:15Z

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:
ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)
Perhaps the guards are on the wrong version of pytorch?
it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

@yao-matrix - yes that is quite odd, but I was able to bisect the failure to this PR, so perhaps it is another code path that this PR is enabling that I'm hitting it from? But it does seem to be resolved by updating the torch version.

Changes from huggingface/transformers#36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <[email protected]>

gante · 2025-03-20T10:29:12Z

@yao-matrix I'm going to revert part of the changes in is_torch_greater_or_equal, as it is breaking in other parts of the library. In a nutshell, we can't confirm that all dev versions for 2.X contain the features that will be release in 2.X, which is the error @loadams is seeing (2.5.0a0+b465a5843b.nv24.9 is a dev version of 2.5.0).

@yao-matrix to enable your use case I'm going to add an accept_dev flag to is_torch_greater_or_equal

Changes from huggingface/transformers#36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <[email protected]>

* fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model * follow Marc's suggestion to use _tie_weights to fix Signed-off-by: Yao, Matrix <[email protected]> * enable OffloadedCache on XPU since PyTorch 2.7 Signed-off-by: Yao, Matrix <[email protected]> * fix style Signed-off-by: Yao, Matrix <[email protected]> * don't change bart Signed-off-by: root <[email protected]> * make code more concise per review comments Signed-off-by: N <[email protected]> * fix review comments Signed-off-by: root <[email protected]> * Revert "fix review comments" This reverts commit acf1484. * fix review comments Signed-off-by: root <[email protected]> * fix style Signed-off-by: root <[email protected]> --------- Signed-off-by: Yao, Matrix <[email protected]> Signed-off-by: root <[email protected]> Signed-off-by: N <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Marc Sun <[email protected]>

Changes from huggingface/transformers#36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: yisheng <[email protected]>

yao-matrix and others added 6 commits March 6, 2025 15:01

fix "Cannot copy out of meta tensor; no data!" issue for BartForCondi…

d56974c

…tionalGeneration model

Merge branch 'huggingface:main' into main

5451453

Merge branch 'main' into main

ddd9443

follow Marc's suggestion to use _tie_weights to fix

9298d45

Signed-off-by: Yao, Matrix <[email protected]>

Merge branch 'huggingface:main' into main

aaf748d

enable OffloadedCache on XPU since PyTorch 2.7

da6901d

Signed-off-by: Yao, Matrix <[email protected]>

github-actions bot marked this pull request as draft March 12, 2025 04:11

yao-matrix and others added 2 commits March 12, 2025 12:12

Merge branch 'main' into offloadedcache

6177fa7

fix style

62e3991

Signed-off-by: Yao, Matrix <[email protected]>

yao-matrix marked this pull request as ready for review March 12, 2025 04:31

github-actions bot requested review from Rocketknight1 and ydshieh March 12, 2025 04:31

don't change bart

36293e3

Signed-off-by: root <[email protected]>

yao-matrix and others added 2 commits March 13, 2025 09:19

Merge branch 'main' into offloadedcache

f5b58e3

Merge branch 'main' into offloadedcache

97beafc

Merge branch 'main' into offloadedcache

5851f66

SunMarc approved these changes Mar 13, 2025

View reviewed changes

n17s reviewed Mar 13, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

src/transformers/cache_utils.py Show resolved Hide resolved

tests/utils/test_cache_utils.py Show resolved Hide resolved

SunMarc requested a review from gante March 14, 2025 13:25

make code more concise per review comments

5d28624

Signed-off-by: N <[email protected]>

n17s approved these changes Mar 14, 2025

View reviewed changes

Merge branch 'main' into offloadedcache

b6b323c

gante approved these changes Mar 17, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

yao-matrix and others added 4 commits March 18, 2025 08:03

Merge branch 'main' into offloadedcache

fa15c53

fix review comments

acf1484

Signed-off-by: root <[email protected]>

Revert "fix review comments"

9148e76

This reverts commit acf1484.

fix review comments

3d0a158

Signed-off-by: root <[email protected]>

fix style

48af80e

Signed-off-by: root <[email protected]>

SunMarc approved these changes Mar 18, 2025

View reviewed changes

SunMarc requested a review from gante March 18, 2025 17:35

Merge branch 'main' into offloadedcache

1ee5788

SunMarc merged commit b11050d into huggingface:main Mar 19, 2025
21 checks passed

loadams mentioned this pull request Mar 19, 2025

Update container version that runs on A6000 tests. deepspeedai/DeepSpeed#7153

Merged

yao-matrix deleted the offloadedcache branch March 19, 2025 23:49

gante mentioned this pull request Mar 20, 2025

[Utils] torch version checks optionally accept dev versions #36847

Merged

enable OffloadedCache on XPU from PyTorch 2.7 #36654

enable OffloadedCache on XPU from PyTorch 2.7 #36654

Uh oh!

Conversation

yao-matrix commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 12, 2025

Uh oh!

yao-matrix commented Mar 12, 2025

Uh oh!

ydshieh commented Mar 13, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

n17s left a comment

Choose a reason for hiding this comment

Uh oh!

gante left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

loadams commented Mar 19, 2025

Uh oh!

yao-matrix commented Mar 19, 2025

Uh oh!

loadams commented Mar 19, 2025

Uh oh!

gante commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yao-matrix commented Mar 12, 2025 •

edited

Loading

gante left a comment •

edited

Loading