Skip to content

KEP-2401: Kubeflow LLM Trainer V2#2410

Merged
google-oss-prow[bot] merged 70 commits into
kubeflow:masterfrom
Electronic-Waste:doc/KEP-2401
Mar 11, 2025
Merged

KEP-2401: Kubeflow LLM Trainer V2#2410
google-oss-prow[bot] merged 70 commits into
kubeflow:masterfrom
Electronic-Waste:doc/KEP-2401

Conversation

@Electronic-Waste

@Electronic-Waste Electronic-Waste commented Feb 1, 2025

Copy link
Copy Markdown
Member

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We are collecting the final community feedback and any suggestions are welcome!

Open Questions

  • We need to pass arguments to tune run CLI to enable distributed training, instead of passing distributed parameters begins with PET_ to env variables. Do you prefer reusing the torch runtime plugin or creating a new one?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

@google-oss-prow

Copy link
Copy Markdown

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, varshaprasad96, truc0, astefanutti, seanlaii.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170

We will collect the final community feedback by 2.12 and start the implementation after that.

Open Questions

  1. Since we adopt torchrun as the launcher for LLM Trainer, do we need to support more launchers like torchtune and accelerate in the future?
  2. Do we want to support Adapter Prompt Tuning and Prefix Tuning?

/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls

coveralls commented Feb 1, 2025

Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 13089269276

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on doc/KEP-2401 at 100.0%

Totals Coverage Status
Change from base Build 13016586638: 100.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

Comment thread docs/proposals/2401-llm-trainer-v2/README.md
Comment thread docs/proposals/2401-llm-trainer-v2/README.md
Comment thread docs/proposals/2401-llm-trainer-v2/README.md
Comment thread docs/proposals/2401-llm-trainer-v2/README.md Outdated
@juliusvonkohout

Copy link
Copy Markdown
Member

Should security, so hard multi-tenancy, istio support and Podsecuritystandards restricted be part of the KEP?

@Electronic-Waste

Electronic-Waste commented Feb 12, 2025

Copy link
Copy Markdown
Member Author

@juliusvonkohout We haven't considered it yet. Our initial goal is to introduce simple approaches to see how users will use this feature, and make it as easy as possible to use. Maybe we could add them as the tasks for the next stage.

WDYT @franciscojavierarceo @kubeflow/wg-training-leads

@franciscojavierarceo

Copy link
Copy Markdown
Contributor

@juliusvonkohout We haven't considered it yet. Our initial goal is to introduce simple approaches to see how users will use this feature, and make it as easy as possible to use. Maybe we could add them as the tasks for the next stage.

I would probably leave that out of scope. Not to say that it's not important of course.

@andreyvelich

Copy link
Copy Markdown
Member

Hi Folks, just a friendly reminder that this Wednesday at 5pm UTC, we will discuss the torchtune usage for Kubeflow Trainer LLM blueprints. Please join if you are available.

cc @kubeflow/wg-training-leads @Electronic-Waste @franciscojavierarceo @joecummings @astefanutti @akshaychitneni @shravan-achar @janeyx99 @bigsur0

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @tenzen-y!
/lgtm
/assign @tenzen-y

@tenzen-y tenzen-y left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, mostly lgtm
Let's add dedicated CRD API after the first iteration (current design) as we discussed in https://cloud-native.slack.com/archives/C0742LDFZ4K/p1741274269321069?thread_ts=1741263570.091899&cid=C0742LDFZ4K.

Note: If you find any blocker based on non dedicated CRD API, before you go to first iteration impl, it would be better to open KEP update PR.

Comment thread docs/proposals/2401-llm-trainer-v2/README.md Outdated
Comment thread docs/proposals/2401-llm-trainer-v2/README.md Outdated
Comment thread docs/proposals/2401-llm-trainer-v2/README.md Outdated
Comment thread docs/proposals/2401-llm-trainer-v2/README.md Outdated
Comment thread docs/proposals/2401-llm-trainer-v2/README.md Outdated
Comment thread docs/proposals/2401-llm-trainer-v2/README.md Outdated
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@google-oss-prow google-oss-prow Bot removed the lgtm label Mar 11, 2025
@Electronic-Waste

Copy link
Copy Markdown
Member Author

@tenzen-y Thanks for your kind reviews! I've addressed all of your comments. Please don't hesitate to let me know if you have any other suggestions!

@andreyvelich

Copy link
Copy Markdown
Member

Thank you @Electronic-Waste!
/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm label Mar 11, 2025
@tenzen-y

Copy link
Copy Markdown
Member

Thank you
/lgtm
/approve

@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit b89ce84 into kubeflow:master Mar 11, 2025
@Electronic-Waste Electronic-Waste deleted the doc/KEP-2401 branch March 11, 2025 11:32
@Electronic-Waste

Copy link
Copy Markdown
Member Author

Thanks for everyone who reviewed this PR! We'll start the initial implementation of this KEP soon. I'm super excited to see the first version of Kubeflow LLM Trainer.

mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Mar 16, 2025
* doc: add initial doc for KEP-2401.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update motivation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add llm lifecycle picture.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add goals and non-goals.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add proposal chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add multiple frameworks support section in design chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add data preprocess section in design chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add fine-tuning config section in design details chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remote all trailing whitespaces.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update llm-trainer-v2-workflow img.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update goals and non-goals.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: remove torchrun proposal to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move torchrun design to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move some fine-tuning config not support by torchtune to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move torchtune sections to proposal and design chapters.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update proposal & move FSDP config to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update fine-tuning config & unify lora/qlora/dora.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update fine-tuning config & fix doc according to comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add model & dataset initialization / model exporting.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dataset preprocess/tokenizer chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update chapter name.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add type in the diagrams.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add optimizer and scheduler config.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add initial parameter override.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update config override.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add CustomTrainingConfig dataclass.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): integrate torchtune mutation logic into torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): split torchtune config chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add two options for SDK & seperate LoRA chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add an example to show parameters mutation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add detailed design on mutation in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dir structure for option 1.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dir structure for option 2.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add Test Plans chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove device parameter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix code line format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update error in proposal example & add num_nodes and resources_per_node to TorchtuneConfig.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update manifests dir in option 1.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): split complement torch plugin chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): move option 1 (reserving recipe and config) to alternatives & reorganize structures.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update goals & add description in propagate torchtune settings in SDK.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): complete map section in SDK.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add maintaining ClusterTrainingRuntime chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update recipe selection.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove some CTRs & only reserve llama family.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename TorchtuneConfig to TorchTuneConfig.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove name prefix in CTRs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update TrainJob and CTR example.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some typos & address comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update complement torch plugin section.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add gemma2 mistral qwen2_5 back.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove the name prefix in CTRs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update typo according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add webhook section.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add webhook func description.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update item format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add the lifecyle of LLM fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove diagram description.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): reorg and update the doc according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some typos.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some format error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename CTRs' file name.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove detailed design.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
akagami-harsh pushed a commit to akagami-harsh/training-operator that referenced this pull request Jul 17, 2025
* doc: add initial doc for KEP-2401.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update motivation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add llm lifecycle picture.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add goals and non-goals.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add proposal chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add multiple frameworks support section in design chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add data preprocess section in design chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add fine-tuning config section in design details chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remote all trailing whitespaces.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update llm-trainer-v2-workflow img.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update goals and non-goals.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: remove torchrun proposal to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move torchrun design to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move some fine-tuning config not support by torchtune to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: move torchtune sections to proposal and design chapters.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update proposal & move FSDP config to alternatives.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update fine-tuning config & unify lora/qlora/dora.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update fine-tuning config & fix doc according to comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add model & dataset initialization / model exporting.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dataset preprocess/tokenizer chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update chapter name.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add type in the diagrams.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add optimizer and scheduler config.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add initial parameter override.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: update config override.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: fix some errors in doc.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add CustomTrainingConfig dataclass.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): integrate torchtune mutation logic into torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): split torchtune config chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add two options for SDK & seperate LoRA chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add an example to show parameters mutation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add detailed design on mutation in torch plugin.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dir structure for option 1.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add dir structure for option 2.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: add Test Plans chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove device parameter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix code line format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update error in proposal example & add num_nodes and resources_per_node to TorchtuneConfig.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update manifests dir in option 1.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): split complement torch plugin chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): move option 1 (reserving recipe and config) to alternatives & reorganize structures.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update goals & add description in propagate torchtune settings in SDK.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): complete map section in SDK.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add maintaining ClusterTrainingRuntime chapter.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update recipe selection.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove some CTRs & only reserve llama family.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename TorchtuneConfig to TorchTuneConfig.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove name prefix in CTRs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update TrainJob and CTR example.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some typos & address comments.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update complement torch plugin section.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add gemma2 mistral qwen2_5 back.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove the name prefix in CTRs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update typo according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add webhook section.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(doc): add webhook func description.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update item format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): add the lifecyle of LLM fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove diagram description.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): reorg and update the doc according to the review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some typos.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): fix some format error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update implementation history.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): rename CTRs' file name.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): remove detailed design.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants