Skip to content

feat: KEP-2437 - PodGroup Creation for Volcano Scheduler#2729

Merged
google-oss-prow[bot] merged 77 commits into
kubeflow:masterfrom
Doris-xm:volcano-podgroup-build
Oct 6, 2025
Merged

feat: KEP-2437 - PodGroup Creation for Volcano Scheduler#2729
google-oss-prow[bot] merged 77 commits into
kubeflow:masterfrom
Doris-xm:volcano-podgroup-build

Conversation

@Doris-xm
Copy link
Copy Markdown
Contributor

@Doris-xm Doris-xm commented Jul 14, 2025

What this PR does / why we need it:
This PR implements PodGroup creation and integration to support the Volcano Scheduler. It introduces a Volcano plugin that:

  • Creates PodGroup CR for each TrainJob based on the scheduling policy defined in TrainingRuntime.
  • Calculates and sets scheduling parameters:
    • minMember: the total minimum number of Pods required to start the job.
    • minTaskMember: minimum required replicas per PodSet (task).
    • minResources: total resource request across all PodSets.
  • Sets optional scheduling hints, like queue, priorityClassName, and networkTopology, if explicitly defined.
  • Unit Tests

Which issue(s) this PR fixes :
Part of #2671

Checklist:

  • Docs included if any changes are user facing

Doris-xm added 9 commits June 29, 2025 14:33
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
@Doris-xm
Copy link
Copy Markdown
Contributor Author

/area gsoc

@google-oss-prow
Copy link
Copy Markdown

@Doris-xm: GitHub didn't allow me to request PR reviews from the following users: rudeigerc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @Electronic-Waste @rudeigerc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Doris-xm
Copy link
Copy Markdown
Contributor Author

/cc @Electronic-Waste @rudeigerc

@google-oss-prow
Copy link
Copy Markdown

@Doris-xm: GitHub didn't allow me to request PR reviews from the following users: rudeigerc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @Electronic-Waste @rudeigerc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Doris-xm added 3 commits July 14, 2025 20:53
Signed-off-by: Xinmin Du <2812493086@qq.com>
# Conflicts:
#	pkg/runtime/framework/plugins/registry.go
Signed-off-by: Xinmin Du <2812493086@qq.com>
@Doris-xm Doris-xm changed the title KEP-2437: PodGroup creation for Volcano Scheduler feat: KEP-2437 - PodGroup Creation for Volcano Scheduler Jul 14, 2025
Doris-xm added 4 commits July 14, 2025 22:27
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Copy link
Copy Markdown
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm Thanks for this. I left my very initial reviews.

Comment thread go.mod
Comment thread pkg/apis/trainer/v1alpha1/trainingruntime_types.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano.go
@Electronic-Waste
Copy link
Copy Markdown
Member

Can you rebase this branch to solve the conflicts? @Doris-xm

Comment thread go.mod Outdated
@Electronic-Waste
Copy link
Copy Markdown
Member

/cc @kubeflow/kubeflow-trainer-team @astefanutti @rudeigerc @Monokaix @JesseStutler

Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Copy link
Copy Markdown
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only left a nit, otherwise LGTM once @tenzen-y comments will be addressed.

Comment on lines +35 to +36
TrainingRuntimeContainerRuntimeClassKey = ".trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
ClusterTrainingRuntimeContainerRuntimeClassKey = ".clusterTrainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could probably be const.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems our tests need to intentionally override these keys to simulate missing indexers.

if tc.emptyCoSchedulingIndexerTrainingRuntimeContainerRuntimeClassKey {
originTrainingRuntimeRuntimeKey := coscheduling.TrainingRuntimeContainerRuntimeClassKey
coscheduling.TrainingRuntimeContainerRuntimeClassKey = ""
t.Cleanup(func() {

Signed-off-by: Xinmin Du <2812493086@qq.com>
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this effort @Doris-xm!
Overall lgtm
just small nit.

})
}

// Test Validate()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for Validate test, we can have another function, like we did for torch:

func TestValidate(t *testing.T) {

We don't have it for coscheduling, since we don't have any validation for that plugin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've tested it separately.

func TestValidate(t *testing.T) {

And the UT only have two fucntions, TestVolcano and TestValidate, now.

Signed-off-by: Xinmin Du <2812493086@qq.com>
@andreyvelich
Copy link
Copy Markdown
Member

/lgtm
/approve
/hold for the final approval from @tenzen-y and @astefanutti

@astefanutti
Copy link
Copy Markdown
Contributor

/lgtm

Thanks

Comment thread hack/swagger/main.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
@google-oss-prow google-oss-prow Bot removed the lgtm label Oct 4, 2025
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
@Doris-xm Doris-xm force-pushed the volcano-podgroup-build branch from 53e8d51 to 4175fef Compare October 5, 2025 04:55
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @Doris-xm Are there any additional comments before we can merge this PR?

Copy link
Copy Markdown
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm, thanks!

Comment thread pkg/runtime/framework/plugins/volcano/volcano_test.go Outdated
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Copy link
Copy Markdown
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm Thank you 👍
/lgtm
/approve

I have seen @andreyvelich and @astefanutti added LGTM.

/hold cancel

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 4e17ce5 into kubeflow:master Oct 6, 2025
26 checks passed
alexxfan pushed a commit to red-hat-data-services/trainer that referenced this pull request Nov 24, 2025
* feat: api for volcano scheduling plugin

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: init volcano-plugin

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: init test file

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: register volcano plugin

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: deal with minTaskMember, minMember, NetworkTopo

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: calculate of minResource

Signed-off-by: Xinmin Du <2812493086@qq.com>

* test: build PodGroup test

Signed-off-by: Xinmin Du <2812493086@qq.com>

* refactor: separate to 2 prs(build&handler)

Signed-off-by: Xinmin Du <2812493086@qq.com>

* test: add test for new&reconcile_builder

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: typo

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: trainer/v2 import

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: networktopo type

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: OpenAPI validation errors

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: remove minTaskMembers

Signed-off-by: Xinmin Du <2812493086@qq.com>

* test: test coverage 100%

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: update apis

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: replace testify

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: registry Volcano CRDs to the scheme

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: add volcano to scheme

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: fix networktopo schema

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: add networktopo spec in trainer

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: unit test

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: import networkTopo directly

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: make generate

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: make generate

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: golangci-lint

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: golangci-lint

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: add volcano installation in integration test

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: filter volcano api

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: get volcano.podgroup  with local version

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: init test env with volcano podgroup installed

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: check plugin in enforcePodgroupPolicy

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: group-name label in unit test

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: ReconcilerBuilders

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: add PodGroupHandler

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: unit test for handlers

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: group name annotation

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update hack/swagger/main.go

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* fix: no need to delete RBAC

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update pkg/runtime/framework/plugins/volcano/indexer.go

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* fix: nil checking for trainjob

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update pkg/runtime/framework/plugins/volcano/volcano.go

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* fix: make generate

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: index conflict

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update pkg/runtime/framework/plugins/coscheduling/coscheduling.go

Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* fix: update volcano to v1.12.2

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: re-use indexer

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: add validate

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: no scheduler when coscheduling is nil

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: put group-name in annotations

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: validate if priorityClass installed

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: propagate annotations to pod

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: integration test for volcano

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: golangci-lint check

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: use shared indexer

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: remove indexer to runtime/

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update hack/swagger/main.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* Update hack/swagger/main.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* fix: append owner reference & missing import

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: rewrite volcano UT

Signed-off-by: Xinmin Du <2812493086@qq.com>

* feat: add copyright

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: sync RBAC to Helm charts

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: refactor UTs

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: test validation separately

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update hack/swagger/main.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* fix: refactor TestVolcano

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: refactor TestValidate

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update pkg/runtime/framework/plugins/volcano/volcano_test.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

---------

Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants