Skip to content

fix: CPU RAM efficient loading for nd or HSDP parallelisms#3740

Merged
S1ro1 merged 1 commit intohuggingface:mainfrom
kmehant:fix-cpu-ram-hsdp
Aug 21, 2025
Merged

fix: CPU RAM efficient loading for nd or HSDP parallelisms#3740
S1ro1 merged 1 commit intohuggingface:mainfrom
kmehant:fix-cpu-ram-hsdp

Conversation

@kmehant
Copy link
Contributor

@kmehant kmehant commented Aug 20, 2025

What does this PR do?

This fixes cpu ram efficient loading bug for nd parallel or even HSDP (replicate + shard) which needs broadcasting the tensor to all world ranks from global rank 0. Since from_pretrained from transformers is designed to load on global rank 0 in ram efficient loading case.

Fixes

    dist.broadcast(full_tensor, src=0, group=device_mesh.get_group())
                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lib/python3.12/site-packages/torch/distributed/device_mesh.py", line 752, in get_group
    raise RuntimeError: ('Found the DeviceMesh have 2 dimensions', 'Optional kwarg `mesh_dim` needs to be specified when device_mesh.ndim > 1.', 'If you want to get the list of all the ProcessGroups in the DeviceMesh,please use `get_all_groups()` instead.')

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@S1ro1 @SunMarc @zach-huggingface

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
@kmehant
Copy link
Contributor Author

kmehant commented Aug 20, 2025

@S1ro1 @SunMarc can we attend to this pressing bug please? Thank you.

@kmehant kmehant changed the title fix: cpu ram efficient loading for nd or hsdp parallelisms fix: CPU RAM efficient loading for nd or HSDP parallelisms Aug 20, 2025
Copy link
Contributor

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kmehant
Copy link
Contributor Author

kmehant commented Aug 21, 2025

@S1ro1 Thank you, can we merge this?

@S1ro1
Copy link
Contributor

S1ro1 commented Aug 21, 2025

Yep, sorry forgot to come back to this after checks passed haha!

@S1ro1 S1ro1 merged commit 979d81e into huggingface:main Aug 21, 2025
25 checks passed
@kmehant
Copy link
Contributor Author

kmehant commented Oct 8, 2025

Hi @S1ro1 @SunMarc When are we making a release on accelerate and including these patches? Would be helpful having these released.

@SunMarc
Copy link
Member

SunMarc commented Oct 8, 2025

Hi @S1ro1 @SunMarc When are we making a release on accelerate and including these patches? Would be helpful having these released.

I will try to do a patch end of the week if I manage to fix some minor bugs !

@kmehant
Copy link
Contributor Author

kmehant commented Oct 9, 2025

Thank you @SunMarc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants