KEP-2401: Kubeflow LLM Trainer V2#2410
Conversation
|
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, varshaprasad96, truc0, astefanutti, seanlaii. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Pull Request Test Coverage Report for Build 13089269276Details
💛 - Coveralls |
|
Should security, so hard multi-tenancy, istio support and Podsecuritystandards restricted be part of the KEP? |
|
@juliusvonkohout We haven't considered it yet. Our initial goal is to introduce simple approaches to see how users will use this feature, and make it as easy as possible to use. Maybe we could add them as the tasks for the next stage. WDYT @franciscojavierarceo @kubeflow/wg-training-leads |
I would probably leave that out of scope. Not to say that it's not important of course. |
|
Hi Folks, just a friendly reminder that this Wednesday at 5pm UTC, we will discuss the cc @kubeflow/wg-training-leads @Electronic-Waste @franciscojavierarceo @joecummings @astefanutti @akshaychitneni @shravan-achar @janeyx99 @bigsur0 |
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
tenzen-y
left a comment
There was a problem hiding this comment.
Thanks, mostly lgtm
Let's add dedicated CRD API after the first iteration (current design) as we discussed in https://cloud-native.slack.com/archives/C0742LDFZ4K/p1741274269321069?thread_ts=1741263570.091899&cid=C0742LDFZ4K.
Note: If you find any blocker based on non dedicated CRD API, before you go to first iteration impl, it would be better to open KEP update PR.
Signed-off-by: Electronic-Waste <2690692950@qq.com>
|
@tenzen-y Thanks for your kind reviews! I've addressed all of your comments. Please don't hesitate to let me know if you have any other suggestions! |
|
Thank you @Electronic-Waste! |
|
Thank you |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thanks for everyone who reviewed this PR! We'll start the initial implementation of this KEP soon. I'm super excited to see the first version of Kubeflow LLM Trainer. |
* doc: add initial doc for KEP-2401. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update motivation. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add llm lifecycle picture. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add goals and non-goals. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add proposal chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add multiple frameworks support section in design chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add data preprocess section in design chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add fine-tuning config section in design details chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remote all trailing whitespaces. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update llm-trainer-v2-workflow img. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update goals and non-goals. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: remove torchrun proposal to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: move torchrun design to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: move some fine-tuning config not support by torchtune to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: move torchtune sections to proposal and design chapters. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update proposal & move FSDP config to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update fine-tuning config & unify lora/qlora/dora. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update fine-tuning config & fix doc according to comments. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add model & dataset initialization / model exporting. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add dataset preprocess/tokenizer chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: fix some errors in doc. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update chapter name. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add type in the diagrams. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add optimizer and scheduler config. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some errors in doc. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add initial parameter override. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update config override. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: fix some errors in doc. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add CustomTrainingConfig dataclass. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): integrate torchtune mutation logic into torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): split torchtune config chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add two options for SDK & seperate LoRA chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add an example to show parameters mutation. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add detailed design on mutation in torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add dir structure for option 1. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add dir structure for option 2. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add Test Plans chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove device parameter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix typo error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix code line format. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update error in proposal example & add num_nodes and resources_per_node to TorchtuneConfig. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update manifests dir in option 1. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): split complement torch plugin chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): move option 1 (reserving recipe and config) to alternatives & reorganize structures. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update goals & add description in propagate torchtune settings in SDK. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): complete map section in SDK. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): add maintaining ClusterTrainingRuntime chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update recipe selection. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove some CTRs & only reserve llama family. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): rename TorchtuneConfig to TorchTuneConfig. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove name prefix in CTRs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update TrainJob and CTR example. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some typos & address comments. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update complement torch plugin section. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add gemma2 mistral qwen2_5 back. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update implementation history. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove the name prefix in CTRs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update typo according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): add webhook section. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): add webhook func description. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update item format. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add the lifecyle of LLM fine-tuning with torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove diagram description. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): reorg and update the doc according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some typos. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some format error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update implementation history. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): rename CTRs' file name. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove detailed design. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
* doc: add initial doc for KEP-2401. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update motivation. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add llm lifecycle picture. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add goals and non-goals. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add proposal chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add multiple frameworks support section in design chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add data preprocess section in design chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add fine-tuning config section in design details chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remote all trailing whitespaces. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update llm-trainer-v2-workflow img. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update goals and non-goals. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: remove torchrun proposal to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: move torchrun design to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: move some fine-tuning config not support by torchtune to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: move torchtune sections to proposal and design chapters. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update proposal & move FSDP config to alternatives. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update fine-tuning config & unify lora/qlora/dora. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update fine-tuning config & fix doc according to comments. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add model & dataset initialization / model exporting. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add dataset preprocess/tokenizer chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: fix some errors in doc. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update chapter name. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add type in the diagrams. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add optimizer and scheduler config. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some errors in doc. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add initial parameter override. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: update config override. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: fix some errors in doc. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add CustomTrainingConfig dataclass. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): integrate torchtune mutation logic into torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): split torchtune config chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add two options for SDK & seperate LoRA chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add an example to show parameters mutation. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add detailed design on mutation in torch plugin. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add dir structure for option 1. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add dir structure for option 2. Signed-off-by: Electronic-Waste <2690692950@qq.com> * doc: add Test Plans chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove device parameter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix typo error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix code line format. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update error in proposal example & add num_nodes and resources_per_node to TorchtuneConfig. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update manifests dir in option 1. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): split complement torch plugin chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): move option 1 (reserving recipe and config) to alternatives & reorganize structures. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update goals & add description in propagate torchtune settings in SDK. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): complete map section in SDK. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): add maintaining ClusterTrainingRuntime chapter. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update recipe selection. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove some CTRs & only reserve llama family. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): rename TorchtuneConfig to TorchTuneConfig. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove name prefix in CTRs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update TrainJob and CTR example. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some typos & address comments. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update complement torch plugin section. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add gemma2 mistral qwen2_5 back. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update implementation history. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove the name prefix in CTRs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update typo according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): add webhook section. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(doc): add webhook func description. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update item format. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): add the lifecyle of LLM fine-tuning with torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove diagram description. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): reorg and update the doc according to the review. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some typos. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): fix some format error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): update implementation history. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): rename CTRs' file name. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(doc): remove detailed design. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
This is the Kubeflow Enhancement Proposal for Kubeflow LLM Trainer V2: http://bit.ly/4gp8JGd
Related: #2401 #2170
We are collecting the final community feedback and any suggestions are welcome!
Open Questions
tune runCLI to enable distributed training, instead of passing distributed parameters begins withPET_to env variables. Do you prefer reusing thetorchruntime plugin or creating a new one?/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0