TP initialization module-by-module #35996

Cyrilvallez · 2025-01-31T17:41:25Z

What does this PR do?

As per the title! At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!

HuggingFaceDocBuilderDev · 2025-01-31T18:52:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Nice nice

ArthurZucker · 2025-02-03T10:03:50Z

src/transformers/modeling_utils.py

+        for submodule in model.modules():
+            full_tp_plan.update(getattr(submodule, "_tp_plan", {}))


IMO we should only do this for PreTrainedModels no?

I figured that maybe it would be a bit more future-proof to iterate over all modules (it's not costly) -- but can be changed for sure!

let's only do Prtrained for now!

ArthurZucker · 2025-02-03T11:10:51Z

src/transformers/modeling_utils.py

+            for param, plan in full_tp_plan.items():
+                # "*" are a placeholder for layer indices, so we replace them by "[0-9]+" in the regex pattern
+                pattern = param.replace("*", "[0-9]+")
+                if re.search(pattern, parent_module_name):
+                    current_module_plan = plan
+                    break
+


I don't think we need to iterate over the full tp_plan, but we should be re-creating the key instead

The tp_plan does not contain the full module names (usually it starts with "layers"), so to be general it's much easier to iterate over the keys instead of starting from the module name and trying to get the key of the tp_plan (because the prefixes of the tp_plan keys may change). Once again it's not costly at all since the tp_plan is very small

mmm agreed cost-wise, it's a tad of a waste! but no worries

ArthurZucker · 2025-02-03T11:14:53Z

src/transformers/modeling_utils.py

+            process_device = list(device_map.values())[0]
+            all_module_parameters_initialized = all(
+                m.device == process_device for m in parent_module.parameters(recurse=False)
+            ) and all(m.device == process_device for m in parent_module.buffers(recurse=False))


similarly this might be a tad bit costly for MOE for example / not necessarily needed.
We can either:

maybe load for the previous layer? (so layer 1 loads layer 0 this way it's always after all bias are loaded?)

check is_hf_initialized as I think it should hold info about everything being initialized
TLDR let's avoid loops

Unfortunately, the shards that are loaded are not necessarily in order, so we cannot rely on it in general... And we check it only for leafs in the state dict (i.e. the Linear/Embedding/Norm layers), so they have at most 2 or 3 parameters(), so not much of an overhead I think. It does not look like we can use is_hf_initialized here (from what I understand it checks that the weights were created, not that the correct state_dict was loaded, and then dispatched to correct device)
In any way, if we did not specify tp_plan="auto", all of it is completely skipped

it checks that weights were properly loaded normally! Because otherwise it goes through the init loop

ArthurZucker · 2025-02-03T11:16:05Z

src/transformers/modeling_utils.py

+                if buffer.device != tp_device:
+                    buffer.data = buffer.to(tp_device)


Interesting, remember that no we pass the cos and sin as input to all layers so to are passed

ArthurZucker · 2025-02-03T11:16:17Z

tests/tp/test_tp.py


-            # The expected full model memory footprint
-            expected_model_memory = 16
+            # The expected model memory footprint. We add 1 as not all the modules are split (e.g. the embeddings)


let's add something related to this in the test

Yes, it currently checks that we do not use more than the expected memory divided by world size , i.e., no more than 5 GiB per GPU on my tests on DGX for Llama 8B (expected memory per device = a bit more than 4 GiB)

ArthurZucker

Temporary solution IMO but much needed thanks let's merge

Cyrilvallez added 8 commits January 31, 2025 18:21

module-by-module loading!

5e198a0

Update modeling_utils.py

190fd7d

dtyle and comments

b7aa37c

Update modeling_utils.py

321f8ee

Update modeling_utils.py

51f0aa0

Update test

2f7e20c

Update modeling_utils.py

6ca1838

Update modeling_utils.py

8c419c6

Cyrilvallez added 4 commits January 31, 2025 22:41

Update test_tp.py

a3a55d0

Update test_tp.py

9ab8539

Update modeling_utils.py

0d8e55a

re-trigger CIs

cb5d92b

Cyrilvallez mentioned this pull request Feb 3, 2025

Improve tensor parallel memory usage #35202

Closed

re-trigger CIs

636a388

ArthurZucker reviewed Feb 3, 2025

View reviewed changes

Cyrilvallez mentioned this pull request Feb 10, 2025

[core] Large/full refactor of from_pretrained #36033

Merged

ArthurZucker approved these changes Feb 19, 2025

View reviewed changes

ArthurZucker merged commit 60226c6 into main Feb 19, 2025
26 checks passed

ArthurZucker deleted the tp-loading branch February 19, 2025 13:05

BowenBao mentioned this pull request Apr 1, 2025

Enhance the memory efficiency of loading large models (400B) to prevent out-of-memory errors when using tensor parallelism. #36467

Closed

nv-guomingz mentioned this pull request Apr 24, 2025

Failed to load model with transformers 4.51.3 when WORLD_SIZE set to 1 on nvidia gpu #37737

Closed

4 tasks

		for submodule in model.modules():
		full_tp_plan.update(getattr(submodule, "_tp_plan", {}))

		if buffer.device != tp_device:
		buffer.data = buffer.to(tp_device)

TP initialization module-by-module #35996

TP initialization module-by-module #35996

Uh oh!

Conversation

Cyrilvallez commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 31, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Jan 31, 2025 •

edited

Loading

Cyrilvallez Feb 3, 2025 •

edited

Loading