Skip to content

Conversation

@Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Mar 12, 2025

All credit goes to @gante and @ArthurZucker for figuring this one out and I'm just swooping in and stealing credit because I want this to be merged quickly!

If no tp_plan is provided, our code uses self.config.base_model_tp_plan as the default. The problem is that this is a mutable, instance-level dict, and we do in fact mutate it, which causes the instance dict to get very large and weird over time. We resolve the issue by correctly copying the base dict as an instance attribute instead of mutating it in-place.

@Rocketknight1 Rocketknight1 marked this pull request as ready for review March 12, 2025 15:43
@github-actions github-actions bot requested a review from ArthurZucker March 12, 2025 15:44
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member Author

Lots of failing checks, and I think these are either Hub timeouts or disguised Hub timeouts

@Rocketknight1
Copy link
Member Author

cc @SunMarc @muellerzr test_gradient_accumulation_loss_alignment_with_model_loss is failing on main. Marking it as slow (it takes ~30s on my machine) just so I can start clearing the CI, but we should address that one at some point!

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@SunMarc
Copy link
Member

SunMarc commented Mar 12, 2025

cc @SunMarc @muellerzr test_gradient_accumulation_loss_alignment_with_model_loss is failing on main. Marking it as slow (it takes ~30s on my machine) just so I can start clearing the CI, but we should address that one at some point!

@IlyasMoutawwakil is looking at it !

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment for robustness!

Comment on lines 1900 to 1904
if not self._tp_plan:
if isinstance(self.config.base_model_tp_plan, dict):
self._tp_plan = self.config.base_model_tp_plan.copy()
else:
self._tp_plan = {}
Copy link
Member

@Cyrilvallez Cyrilvallez Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads-up that if a model which is above the base model in the class hierarchy does not have a tp_plan, we are going to wrongly add again the base_model_plan.
It's not the case as of now in the library though. But maybe

if self.base_model is self:
    self._pp_plan = self.config.base_model_pp_plan.copy() if self.config.base_model_pp_plan is not None else None
    self._tp_plan = self.config.base_model_tp_plan.copy() if self.config.base_model_tp_plan is not None else {}
else:
    self._tp_plan = self._tp_plan or {}
    for name, module in self.named_children():
        if plan := getattr(module, "_tp_plan", None):
            self._tp_plan.update({f"{name}.{k}": v for k, v in plan.items()})

is more robust and should still work with composite models

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i just edited haha - the part with pp is less important as we don't modify the dict later

@Rocketknight1 Rocketknight1 merged commit c7eb955 into main Mar 12, 2025
24 checks passed
@Rocketknight1 Rocketknight1 deleted the fix_base_tp_plan_blowup branch March 12, 2025 18:59
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants