Skip to content

Conversation

@kmehant
Copy link
Contributor

@kmehant kmehant commented Mar 27, 2025

What does this PR do?

Discussed at huggingface/accelerate#3457

  1. Introduce tp_size to allow for TP sharding apart from world size
  2. Make tp_size an attribute of the model only initialized after TP sharding completed which can be an indicator if the model has undergone tp sharding for usage in accelerate. (discussed with @SunMarc)
  3. Remove tp_size from train arguments, since from now on it is to perform TP training only if the model has undergone TP sharding already.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@github-actions github-actions bot marked this pull request as draft March 27, 2025 19:48
@kmehant kmehant marked this pull request as ready for review March 27, 2025 19:49
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a couple of comments

Comment on lines +3963 to +3852
tp_size (`str`, *optional*):
A torch tensor parallel degree. If not provided would default to world size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this specific PR. I don't know if we want to add this option yet cc @ArthurZucker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have it in a separate PR as well, however, its needed to support TP + FSDP/DDP.

I don't know if we want to add this option yet

Sure, @ArthurZucker Let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Would appreciate it here, been looking at enabling TP + FSDP and this is exactly what I used myself.
cc @ArthurZucker

@SunMarc SunMarc requested a review from ArthurZucker March 28, 2025 16:31
@SunMarc
Copy link
Member

SunMarc commented Apr 7, 2025

Please fix the conflits and I will merge this PR !

@kmehant kmehant force-pushed the tp-size branch 2 times, most recently from a33e9ef to b7abb2a Compare April 7, 2025 14:49
@kmehant
Copy link
Contributor Author

kmehant commented Apr 7, 2025

#37054 (comment)

@SunMarc Fixed the conflicts and the failing test seem to be unrelated. Thanks

@kmehant
Copy link
Contributor Author

kmehant commented Apr 7, 2025

@SunMarc looks like even the recently merged commit is failing for this testcase, so its totally unrelated to this PR.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits, thanks !

Comment on lines 1 to 10
import torch

from transformers import AutoModelForCausalLM


m2 = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", tp_plan=None)
m = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", tp_plan="auto")

ft = m.lm_head.weight.full_tensor().to("cpu")
assert torch.equal(ft, m2.lm_head.weight.to("cpu"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add this in the tensor_parallel test file instead of having this here. Please also add a description of what you are trying to do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, this file is not intended for this PR, hence removed, thanks.

generation_config = kwargs.pop("generation_config", None)
gguf_file = kwargs.pop("gguf_file", None)
tp_plan = kwargs.pop("tp_plan", None)
tp_size = kwargs.pop("tp_size", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's raise an error if tp_size was set but not tp_plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Addressed this comment, thank you.

@kmehant kmehant force-pushed the tp-size branch 3 times, most recently from 307fc4e to 33af129 Compare April 8, 2025 14:07
Copy link
Contributor

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@kmehant kmehant force-pushed the tp-size branch 2 times, most recently from ccf1889 to 43bb071 Compare April 8, 2025 16:17
@kmehant
Copy link
Contributor Author

kmehant commented Apr 9, 2025

@SunMarc rebased the branch, are we waiting on something?

@SunMarc
Copy link
Member

SunMarc commented Apr 9, 2025

Waiting for the tests to pass ;) I will merge it as soon as the ci is green !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc merged commit 7d76876 into huggingface:main Apr 10, 2025
19 checks passed
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

cyr0930 pushed a commit to cyr0930/transformers that referenced this pull request Apr 18, 2025
…face#37054)

* feat: custom tp_size, new transformers tp interface

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: review cmt - error when tp_plan not set for tp_size

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: nit in docs

Signed-off-by: Mehant Kammakomati <[email protected]>

---------

Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Matej Sirovatka <[email protected]>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
…face#37054)

* feat: custom tp_size, new transformers tp interface

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: review cmt - error when tp_plan not set for tp_size

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: nit in docs

Signed-off-by: Mehant Kammakomati <[email protected]>

---------

Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Matej Sirovatka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants