(Part 2) feat: allow for tp_size attr for tplizing the model by kmehant · Pull Request #37054 · huggingface/transformers

kmehant · 2025-03-27T19:48:42Z

What does this PR do?

Discussed at huggingface/accelerate#3457

Introduce tp_size to allow for TP sharding apart from world size
Make tp_size an attribute of the model only initialized after TP sharding completed which can be an indicator if the model has undergone tp sharding for usage in accelerate. (discussed with @SunMarc)
Remove tp_size from train arguments, since from now on it is to perform TP training only if the model has undergone TP sharding already.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

trainer: @muellerzr and @SunMarc

github-actions · 2025-03-27T19:48:55Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

SunMarc

Thanks, left a couple of comments

SunMarc · 2025-03-28T16:27:06Z

src/transformers/modeling_utils.py

+            tp_size (`str`, *optional*):
+                A torch tensor parallel degree. If not provided would default to world size.


Not needed for this specific PR. I don't know if we want to add this option yet cc @ArthurZucker

We can have it in a separate PR as well, however, its needed to support TP + FSDP/DDP.

I don't know if we want to add this option yet

Sure, @ArthurZucker Let me know your thoughts.

@SunMarc Would appreciate it here, been looking at enabling TP + FSDP and this is exactly what I used myself.
cc @ArthurZucker

src/transformers/training_args.py

SunMarc · 2025-04-07T13:25:46Z

Please fix the conflits and I will merge this PR !

kmehant · 2025-04-07T14:57:54Z

#37054 (comment)

@SunMarc Fixed the conflicts and the failing test seem to be unrelated. Thanks

kmehant · 2025-04-07T15:30:34Z

@SunMarc looks like even the recently merged commit is failing for this testcase, so its totally unrelated to this PR.

SunMarc

A few minor nits, thanks !

SunMarc · 2025-04-08T10:00:11Z

src/transformers/integrations/test.py

+import torch
+
+from transformers import AutoModelForCausalLM
+
+
+m2 = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", tp_plan=None)
+m = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", tp_plan="auto")
+
+ft = m.lm_head.weight.full_tensor().to("cpu")
+assert torch.equal(ft, m2.lm_head.weight.to("cpu"))


let's add this in the tensor_parallel test file instead of having this here. Please also add a description of what you are trying to do

Apologies, this file is not intended for this PR, hence removed, thanks.

SunMarc · 2025-04-08T10:03:39Z

src/transformers/modeling_utils.py

        generation_config = kwargs.pop("generation_config", None)
        gguf_file = kwargs.pop("gguf_file", None)
        tp_plan = kwargs.pop("tp_plan", None)
+        tp_size = kwargs.pop("tp_size", None)


let's raise an error if tp_size was set but not tp_plan

@SunMarc Addressed this comment, thank you.

S1ro1

LGTM!

SunMarc

Thanks !

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant · 2025-04-09T13:27:12Z

@SunMarc rebased the branch, are we waiting on something?

SunMarc · 2025-04-09T13:29:05Z

Waiting for the tests to pass ;) I will merge it as soon as the ci is green !

HuggingFaceDocBuilderDev · 2025-04-09T13:54:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Nice!

…face#37054) * feat: custom tp_size, new transformers tp interface Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: review cmt - error when tp_plan not set for tp_size Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: nit in docs Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>

github-actions bot marked this pull request as draft March 27, 2025 19:48

kmehant marked this pull request as ready for review March 27, 2025 19:49

github-actions bot requested review from Rocketknight1 and stevhliu March 27, 2025 19:49

kmehant mentioned this pull request Mar 27, 2025

(Part 1) fix: make TP training compatible with new transformers huggingface/accelerate#3457

Merged

5 tasks

SunMarc reviewed Mar 28, 2025

View reviewed changes

SunMarc requested a review from ArthurZucker March 28, 2025 16:31

kmehant force-pushed the tp-size branch 2 times, most recently from a33e9ef to b7abb2a Compare April 7, 2025 14:49

kmehant force-pushed the tp-size branch from b7abb2a to c7696a2 Compare April 8, 2025 06:42

SunMarc reviewed Apr 8, 2025

View reviewed changes

kmehant force-pushed the tp-size branch 3 times, most recently from 307fc4e to 33af129 Compare April 8, 2025 14:07

S1ro1 approved these changes Apr 8, 2025

View reviewed changes

SunMarc approved these changes Apr 8, 2025

View reviewed changes

kmehant force-pushed the tp-size branch 2 times, most recently from ccf1889 to 43bb071 Compare April 8, 2025 16:17

kmehant added 3 commits April 9, 2025 18:56

feat: custom tp_size, new transformers tp interface

1059fff

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: review cmt - error when tp_plan not set for tp_size

bb2950d

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: nit in docs

a65130c

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-size branch from 1934f7c to a65130c Compare April 9, 2025 13:26

Merge branch 'main' into tp-size

d77505c

Merge branch 'main' into tp-size

c073736

SunMarc merged commit 7d76876 into huggingface:main Apr 10, 2025
19 checks passed

ArthurZucker reviewed Apr 10, 2025

View reviewed changes

S1ro1 mentioned this pull request Apr 15, 2025

Tensor parallel support for LLM training. #37505

Open

		tp_size (`str`, optional):
		A torch tensor parallel degree. If not provided would default to world size.

Conversation

kmehant commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 27, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SunMarc commented Apr 7, 2025

Uh oh!

kmehant commented Apr 7, 2025

Uh oh!

kmehant commented Apr 7, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

kmehant commented Apr 9, 2025

Uh oh!

SunMarc commented Apr 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2025

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kmehant commented Mar 27, 2025 •

edited

Loading