feat: KEP-2437 - PodGroup Creation for Volcano Scheduler#2729
Conversation
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
|
/area gsoc |
|
@Doris-xm: GitHub didn't allow me to request PR reviews from the following users: rudeigerc. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@Doris-xm: GitHub didn't allow me to request PR reviews from the following users: rudeigerc. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
# Conflicts: # pkg/runtime/framework/plugins/registry.go
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Electronic-Waste
left a comment
There was a problem hiding this comment.
@Doris-xm Thanks for this. I left my very initial reviews.
|
Can you rebase this branch to solve the conflicts? @Doris-xm |
|
/cc @kubeflow/kubeflow-trainer-team @astefanutti @rudeigerc @Monokaix @JesseStutler |
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
astefanutti
left a comment
There was a problem hiding this comment.
Only left a nit, otherwise LGTM once @tenzen-y comments will be addressed.
| TrainingRuntimeContainerRuntimeClassKey = ".trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName" | ||
| ClusterTrainingRuntimeContainerRuntimeClassKey = ".clusterTrainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName" |
There was a problem hiding this comment.
These could probably be const.
There was a problem hiding this comment.
It seems our tests need to intentionally override these keys to simulate missing indexers.
trainer/pkg/runtime/framework/core/framework_test.go
Lines 140 to 143 in b918411
Signed-off-by: Xinmin Du <2812493086@qq.com>
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this effort @Doris-xm!
Overall lgtm
just small nit.
| }) | ||
| } | ||
|
|
||
| // Test Validate() |
There was a problem hiding this comment.
I think for Validate test, we can have another function, like we did for torch:
We don't have it for coscheduling, since we don't have any validation for that plugin.
There was a problem hiding this comment.
Yes, I've tested it separately.
And the UT only have two fucntions,
TestVolcano and TestValidate, now.
Signed-off-by: Xinmin Du <2812493086@qq.com>
|
/lgtm |
|
/lgtm Thanks |
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
…cano-podgroup-build
Signed-off-by: Xinmin Du <2812493086@qq.com>
53e8d51 to
4175fef
Compare
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com>
There was a problem hiding this comment.
@Doris-xm Thank you 👍
/lgtm
/approve
I have seen @andreyvelich and @astefanutti added LGTM.
/hold cancel
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* feat: api for volcano scheduling plugin Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: init volcano-plugin Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: init test file Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: register volcano plugin Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: deal with minTaskMember, minMember, NetworkTopo Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: calculate of minResource Signed-off-by: Xinmin Du <2812493086@qq.com> * test: build PodGroup test Signed-off-by: Xinmin Du <2812493086@qq.com> * refactor: separate to 2 prs(build&handler) Signed-off-by: Xinmin Du <2812493086@qq.com> * test: add test for new&reconcile_builder Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: typo Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: trainer/v2 import Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: networktopo type Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: OpenAPI validation errors Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: remove minTaskMembers Signed-off-by: Xinmin Du <2812493086@qq.com> * test: test coverage 100% Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: update apis Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: replace testify Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: registry Volcano CRDs to the scheme Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: add volcano to scheme Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: fix networktopo schema Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: add networktopo spec in trainer Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: unit test Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: import networkTopo directly Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: make generate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: make generate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: golangci-lint Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: golangci-lint Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add volcano installation in integration test Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: filter volcano api Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: get volcano.podgroup with local version Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: init test env with volcano podgroup installed Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: check plugin in enforcePodgroupPolicy Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: group-name label in unit test Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: ReconcilerBuilders Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add PodGroupHandler Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: unit test for handlers Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: group name annotation Signed-off-by: Xinmin Du <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: no need to delete RBAC Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/volcano/indexer.go Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: nil checking for trainjob Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/volcano/volcano.go Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: make generate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: index conflict Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/coscheduling/coscheduling.go Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: update volcano to v1.12.2 Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: re-use indexer Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add validate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: no scheduler when coscheduling is nil Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: put group-name in annotations Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: validate if priorityClass installed Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: propagate annotations to pod Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: integration test for volcano Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: golangci-lint check Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: use shared indexer Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: remove indexer to runtime/ Signed-off-by: Xinmin Du <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: append owner reference & missing import Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: rewrite volcano UT Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add copyright Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: sync RBAC to Helm charts Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: refactor UTs Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: test validation separately Signed-off-by: Xinmin Du <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: refactor TestVolcano Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: refactor TestValidate Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/volcano/volcano_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> --------- Signed-off-by: Xinmin Du <2812493086@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Shao Wang <2690692950@qq.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
What this PR does / why we need it:
This PR implements
PodGroupcreation and integration to support the Volcano Scheduler. It introduces aVolcanoplugin that:PodGroupCR for eachTrainJobbased on the scheduling policy defined inTrainingRuntime.minMember: the total minimum number of Pods required to start the job.minTaskMember: minimum required replicas perPodSet(task).minResources: total resource request across all PodSets.queue,priorityClassName, andnetworkTopology, if explicitly defined.Which issue(s) this PR fixes :
Part of #2671
Checklist: