Skip to content

KEP-2401: Complement torch plugin to support torchtune config mutation#2587

Merged
google-oss-prow[bot] merged 26 commits into
kubeflow:masterfrom
Electronic-Waste:feat/torchtune-plugin
Apr 29, 2025
Merged

KEP-2401: Complement torch plugin to support torchtune config mutation#2587
google-oss-prow[bot] merged 26 commits into
kubeflow:masterfrom
Electronic-Waste:feat/torchtune-plugin

Conversation

@Electronic-Waste

@Electronic-Waste Electronic-Waste commented Apr 8, 2025

Copy link
Copy Markdown
Member

What this PR does / why we need it:

This PR adds the torchtune config mutation implementation in torch plugin.

As we discussed before, we'll implement the config mutation/validation in server-side, so as to avoid frequent SDK changes and provide better backward compatibility for users.

In details, this PR:

  • Add config mutation/mapping for the torchtune configs passed in .spec.trainer.args
  • Add valiations for torchtune config
  • Add more UTs for torch EnforceMLPolicy and Validate function

REF: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2401-llm-trainer-v2#complement-torch-plugin

/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo @saileshd1402 @deepanker13 @akshaychitneni

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2507 #2508

Checklist:

  • Docs included if any changes are user facing

… plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@google-oss-prow

Copy link
Copy Markdown

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

What this PR does / why we need it:

This PR adds the torchtune config mutation implementation in torch plugin.

/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo @saileshd1402 @deepanker13 @akshaychitneni

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2507

Checklist:

  • Docs included if any changes are user facing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste Electronic-Waste marked this pull request as draft April 8, 2025 14:16
@Electronic-Waste Electronic-Waste changed the title KEP-2401: Complement torch plugin to support torchtune config mutation [WIP] KEP-2401: Complement torch plugin to support torchtune config mutation Apr 8, 2025
Signed-off-by: Electronic-Waste <2690692950@qq.com>
… Args.

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@coveralls

coveralls commented Apr 9, 2025

Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 14692575862

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 59 of 61 (96.72%) changed or added relevant lines in 1 file are covered.
  • 62 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.7%) to 67.18%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/torch/torch.go 59 61 96.72%
Files with Coverage Reduction New Missed Lines %
pkg/runtime/framework/plugins/coscheduling/coscheduling.go 62 0.0%
Totals Coverage Status
Change from base Build 14341999020: 0.7%
Covered Lines: 1789
Relevant Lines: 2663

💛 - Coveralls

Signed-off-by: Electronic-Waste <2690692950@qq.com>
@google-oss-prow google-oss-prow Bot added size/XL and removed size/L labels Apr 9, 2025
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste Electronic-Waste marked this pull request as ready for review April 9, 2025 09:39
@Electronic-Waste Electronic-Waste changed the title [WIP] KEP-2401: Complement torch plugin to support torchtune config mutation KEP-2401: Complement torch plugin to support torchtune config mutation Apr 9, 2025
@Electronic-Waste

Copy link
Copy Markdown
Member Author

PTAL if you have time, thanks:)

/assign @kubeflow/wg-training-leads @astefanutti @akshaychitneni @franciscojavierarceo @deepanker13

Comment thread pkg/runtime/framework/plugins/torch/torch.go Outdated
Comment thread sdk/kubeflow/trainer/utils/utils.py Outdated
@google-oss-prow

Copy link
Copy Markdown

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

mostly lgtm, just a few comments.
/assign @tenzen-y @saileshd1402 @astefanutti

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Electronic-Waste <2690692950@qq.com>
@google-oss-prow google-oss-prow Bot added size/L and removed size/XL labels Apr 22, 2025
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Comment thread pkg/runtime/framework/plugins/torch/torch.go Outdated
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@google-oss-prow google-oss-prow Bot added size/XL and removed size/L labels Apr 27, 2025
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste

Copy link
Copy Markdown
Member Author

@andreyvelich Thanks for your detailed review! I've addressed your comments. PTAL if you have time:)

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm.
Just small comment.
/lgtm
/assign @tenzen-y @kubeflow/wg-training-leads @saileshd1402 @astefanutti @franciscojavierarceo for the review.

Comment thread pkg/runtime/framework/plugins/torch/torch.go

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we can move forward.
Let's address any additional changes in the followup PRs.
Thank you for this @Electronic-Waste!
/approve

@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 040b34e into kubeflow:master Apr 29, 2025
1 check passed
@google-oss-prow google-oss-prow Bot added this to the v2.0 milestone Apr 29, 2025
@Electronic-Waste Electronic-Waste deleted the feat/torchtune-plugin branch April 29, 2025 13:26
@Electronic-Waste

Copy link
Copy Markdown
Member Author

@andreyvelich @astefanutti Thanks for your detailed review! I'll create another issue to discuss about: #2587 (comment).

akagami-harsh pushed a commit to akagami-harsh/training-operator that referenced this pull request Jul 17, 2025
kubeflow#2587)

* chore(plugin): Add torchtune-related constants & update current torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Add EnforceMLPolicy for torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Add UTs in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): fix error in torch plugin UTs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(sdk): Add PretrainedModel enum type.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(plugin): Add torchtune config arg.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(test): add UT for single-device full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(test): Add test for multi-nodes full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(test): Update torch validate UTs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove pretrained model enum type in sdk.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): retrieve model name from runtimeRef.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(lint): fix typo.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): make some adjustments according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): pass PET_ env variables in torch plugin for torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): add env validation for torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugin): update comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugins): fix the implementation according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(plugins): fix UT error in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix UT and e2e tests error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: remove debug info.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): add args in UTs related to torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): update torchtune related args.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): Add a UT for multi-node mode check in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEP-2401: Complement torch plugin to support torchtune config mutation

6 participants