Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
8867e71
feat: api for volcano scheduling plugin
Doris-xm Jun 16, 2025
0cd20fc
feat: init volcano-plugin
Doris-xm Jun 29, 2025
2e911ef
feat: init test file
Doris-xm Jun 29, 2025
ed0425f
feat: register volcano plugin
Doris-xm Jun 30, 2025
fca9f40
feat: deal with minTaskMember, minMember, NetworkTopo
Doris-xm Jul 13, 2025
8ab5d71
fix: calculate of minResource
Doris-xm Jul 13, 2025
bd60987
test: build PodGroup test
Doris-xm Jul 13, 2025
ec6c5a8
refactor: separate to 2 prs(build&handler)
Doris-xm Jul 14, 2025
f182b65
test: add test for new&reconcile_builder
Doris-xm Jul 14, 2025
ffa27f9
fix: typo
Doris-xm Jul 14, 2025
f8ea7dd
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Jul 14, 2025
4fa3b6a
fix: trainer/v2 import
Doris-xm Jul 14, 2025
1a247da
fix: networktopo type
Doris-xm Jul 14, 2025
73bb476
fix: OpenAPI validation errors
Doris-xm Jul 14, 2025
b92383a
fix: remove minTaskMembers
Doris-xm Jul 28, 2025
fe6ffd0
test: test coverage 100%
Doris-xm Jul 28, 2025
1c33eba
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Aug 3, 2025
7cbad55
feat: update apis
Doris-xm Aug 4, 2025
cdd9309
feat: replace testify
Doris-xm Aug 4, 2025
8e68bba
fix: registry Volcano CRDs to the scheme
Doris-xm Aug 4, 2025
79618cb
fix: add volcano to scheme
Doris-xm Aug 10, 2025
8814111
fix: fix networktopo schema
Doris-xm Aug 18, 2025
114f11b
fix: add networktopo spec in trainer
Doris-xm Aug 18, 2025
c8fa0fd
fix: unit test
Doris-xm Aug 20, 2025
f8d8912
feat: import networkTopo directly
Doris-xm Aug 20, 2025
c084ac8
fix: make generate
Doris-xm Aug 21, 2025
f65d7a7
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Aug 21, 2025
aa90695
fix: make generate
Doris-xm Aug 21, 2025
83c0585
fix: golangci-lint
Doris-xm Aug 21, 2025
6f24588
fix: golangci-lint
Doris-xm Aug 25, 2025
d2fd159
feat: add volcano installation in integration test
Doris-xm Aug 25, 2025
e95a158
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Aug 25, 2025
d582a81
fix: filter volcano api
Doris-xm Sep 1, 2025
fe19174
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Sep 1, 2025
deafc5d
fix: get volcano.podgroup with local version
Doris-xm Sep 1, 2025
d8bae4a
fix: init test env with volcano podgroup installed
Doris-xm Sep 1, 2025
cf578a2
fix: check plugin in enforcePodgroupPolicy
Doris-xm Sep 6, 2025
a332de4
fix: group-name label in unit test
Doris-xm Sep 6, 2025
a4f09e6
fix: ReconcilerBuilders
Doris-xm Sep 7, 2025
71996b6
feat: add PodGroupHandler
Doris-xm Sep 8, 2025
99e0462
feat: unit test for handlers
Doris-xm Sep 9, 2025
583b1d6
fix: group name annotation
Doris-xm Sep 15, 2025
e60a5a9
Update hack/swagger/main.go
Doris-xm Sep 15, 2025
ef25760
fix: no need to delete RBAC
Doris-xm Sep 15, 2025
28bcd41
Update pkg/runtime/framework/plugins/volcano/indexer.go
Doris-xm Sep 15, 2025
e2b1d89
fix: nil checking for trainjob
Doris-xm Sep 15, 2025
d037828
Update pkg/runtime/framework/plugins/volcano/volcano.go
Doris-xm Sep 15, 2025
aaf47a2
Merge remote-tracking branch 'origin/volcano-podgroup-build' into vol…
Doris-xm Sep 15, 2025
9c1c5e0
fix: make generate
Doris-xm Sep 15, 2025
8497e6b
fix: index conflict
Doris-xm Sep 16, 2025
67bbedb
Update pkg/runtime/framework/plugins/coscheduling/coscheduling.go
Doris-xm Sep 16, 2025
04aa3e8
fix: update volcano to v1.12.2
Doris-xm Sep 16, 2025
e044c4f
Merge remote-tracking branch 'origin/volcano-podgroup-build' into vol…
Doris-xm Sep 16, 2025
167a595
feat: re-use indexer
Doris-xm Sep 16, 2025
d52581e
feat: add validate
Doris-xm Sep 19, 2025
a8be8ca
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Sep 19, 2025
8756dc7
fix: no scheduler when coscheduling is nil
Doris-xm Sep 19, 2025
7ba807e
fix: put group-name in annotations
Doris-xm Sep 19, 2025
6075a0c
feat: validate if priorityClass installed
Doris-xm Sep 21, 2025
132bb16
feat: propagate annotations to pod
Doris-xm Sep 22, 2025
7065606
feat: integration test for volcano
Doris-xm Sep 23, 2025
e6c7646
fix: golangci-lint check
Doris-xm Sep 23, 2025
2eb8629
feat: use shared indexer
Doris-xm Sep 28, 2025
4ede544
feat: remove indexer to runtime/
Doris-xm Sep 28, 2025
ef69e38
Update hack/swagger/main.go
Doris-xm Oct 1, 2025
63393c5
Update hack/swagger/main.go
Doris-xm Oct 1, 2025
2fad831
fix: append owner reference & missing import
Doris-xm Oct 1, 2025
f0e5a4c
fix: rewrite volcano UT
Doris-xm Oct 1, 2025
9c1eae1
feat: add copyright
Doris-xm Oct 1, 2025
e185049
fix: sync RBAC to Helm charts
Doris-xm Oct 1, 2025
497a0b6
fix: refactor UTs
Doris-xm Oct 2, 2025
4030ce5
fix: test validation separately
Doris-xm Oct 3, 2025
c886583
Update hack/swagger/main.go
Doris-xm Oct 4, 2025
451bf6e
fix: refactor TestVolcano
Doris-xm Oct 5, 2025
49bd5b9
Merge remote-tracking branch 'origin/volcano-podgroup-build' into vol…
Doris-xm Oct 5, 2025
4175fef
fix: refactor TestValidate
Doris-xm Oct 5, 2025
1f05304
Update pkg/runtime/framework/plugins/volcano/volcano_test.go
Doris-xm Oct 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions go.mod
Comment thread
Doris-xm marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ require (
github.com/onsi/ginkgo/v2 v2.22.2
github.com/onsi/gomega v1.36.2
github.com/open-policy-agent/cert-controller v0.12.0
github.com/stretchr/testify v1.10.0
Comment thread
Doris-xm marked this conversation as resolved.
Outdated
go.uber.org/zap v1.27.0
golang.org/x/crypto v0.36.0
k8s.io/api v0.32.2
Expand All @@ -23,6 +24,7 @@ require (
sigs.k8s.io/kind v0.27.0
sigs.k8s.io/scheduler-plugins v0.30.6
sigs.k8s.io/structured-merge-diff/v4 v4.5.0
volcano.sh/apis v1.12.1
Comment thread
Doris-xm marked this conversation as resolved.
Outdated
)

require (
Expand Down Expand Up @@ -59,6 +61,7 @@ require (
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/pelletier/go-toml v1.9.5 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/prometheus/client_golang v1.21.0 // indirect
github.com/prometheus/client_model v0.6.1 // indirect
github.com/prometheus/common v0.62.0 // indirect
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -223,3 +223,5 @@ sigs.k8s.io/structured-merge-diff/v4 v4.5.0 h1:nbCitCK2hfnhyiKo6uf2HxUPTCodY6Qaf
sigs.k8s.io/structured-merge-diff/v4 v4.5.0/go.mod h1:N8f93tFZh9U6vpxwRArLiikrE5/2tiu1w1AGfACIGE4=
sigs.k8s.io/yaml v1.4.0 h1:Mk1wCc2gy/F0THH0TAp1QYyJNzRm2KCLy3o5ASXVI5E=
sigs.k8s.io/yaml v1.4.0/go.mod h1:Ejl7/uTz7PSA4eKMyQCUTnhZYNmLIl+5c2lQPGR2BPY=
volcano.sh/apis v1.12.1 h1:yq5dVj/g21vnWObCIKsJKPhMoThpzDrHDD/GMouYVxk=
volcano.sh/apis v1.12.1/go.mod h1:0XNNnIOevJSYNiXRmwhXUrYCcCcWcBeTY0nxrlkk03A=
23 changes: 22 additions & 1 deletion pkg/apis/trainer/v1alpha1/trainingruntime_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/intstr"
jobsetv1alpha2 "sigs.k8s.io/jobset/api/jobset/v1alpha2"
volcanov1beta1 "volcano.sh/apis/pkg/apis/scheduling/v1beta1"
)

const (
Expand Down Expand Up @@ -134,7 +135,8 @@ type PodGroupPolicySource struct {
// Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling.
Coscheduling *CoschedulingPodGroupPolicySource `json:"coscheduling,omitempty"`

// TODO (andreyvelich): Add support for Volcano gang-scheduler.
// Volcano plugin for gang-scheduling.
Volcano *VolcanoPodGroupPolicySource `json:"volcano,omitempty"`
}

// CoschedulingPodGroupPolicySource represents configuration for coscheduling plugin.
Expand All @@ -147,6 +149,25 @@ type CoschedulingPodGroupPolicySource struct {
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`
}

// VolcanoPodGroupPolicySource represents configuration for the Volcano gang-scheduler.
type VolcanoPodGroupPolicySource struct {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, would like to why other fields are not present, like queue, minMember, minResource?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As introduced here, we'll use annotation to inject queue.

- `Queue`: A collection of PodGroups, which adopts `FIFO`. It is also used as the basis for resource division.
- It is configured via annotations `scheduling.volcano.sh/queue-name`. The field is initially set in TrainingRuntime, but can **be overridden by the TrainJob**.

For minMember and minResource , they're calculated from Podset.

- `MinMember`: Defines the minimum number of members/tasks required to run the PodGroup. This is the total count of all Pods in the PodSet.
- `MinResources`: Defines the minimal resource of members/tasks to run the pod group. This is the sum of resource requests (such as CPU and memory) for all Pods in the PodSet.

Copy link
Copy Markdown

@Monokaix Monokaix Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, so users cannot set a custom minmember, it will always be equal to the number of replicas?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm currently volcano uses 'scheduling.k8s.io/group-name' as group name annotation, not 'scheduling.volcano.sh/group-name', as can be seen in https://github.com/volcano-sh/volcano/blob/5007c6b010a44518ec2d946c6d126f2dfdadc980/pkg/controllers/podgroup/pg_controller_handler.go#L189 cc @Monokaix

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autually also want to know why not put all these attrs in VolcanoPodGroupPolicySource?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in the KEP, we want to make it consistent with other schedulers like Coscheduling or Kueue: #2672 (comment)
Since we can set Queue name in the Pod labels and dynamically calculate the minMembers in the plugin, adding those fields in the VolcanoPodGroupPolicySource API is unnecessary.

// Queue is the name of the Volcano queue to which the PodGroup will be submitted.
// Defaults to the “default” queue, which has the lowest weight.
// +kubebuilder:default=default
Queue *string `json:"queue,omitempty"`

// PriorityClassName is the name of the Kubernetes PriorityClass to use for the PodGroup.
// e.g. system-node-critical, system-cluster-critical.
// This field is optional.
PriorityClassName *string `json:"priorityClassName,omitempty"`
Comment thread
Doris-xm marked this conversation as resolved.
Outdated

// NetworkTopology defines the NetworkTopology config, this field works in conjunction with network topology feature and hyperNode CRD.
// +kubebuilder:validation:EmbeddedResource
// +kubebuilder:pruning:PreserveUnknownFields
// +optional
NetworkTopology *volcanov1beta1.NetworkTopologySpec `json:"networkTopology,omitempty"`
}

// MLPolicy represents configuration for the model trining with ML-specific parameters.
// +kubebuilder:validation:XValidation:rule="!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", message="numNodes should not be set if torch.elasticPolicy is configured"
// +kubebuilder:validation:XValidation:rule="!(has(self.torch) && has(self.mpi))", message="Only one of the policy can be configured"
Expand Down
2 changes: 2 additions & 0 deletions pkg/runtime/framework/plugins/registry.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,15 @@ import (
"github.com/kubeflow/trainer/v2/pkg/runtime/framework/plugins/mpi"
"github.com/kubeflow/trainer/v2/pkg/runtime/framework/plugins/plainml"
"github.com/kubeflow/trainer/v2/pkg/runtime/framework/plugins/torch"
"github.com/kubeflow/trainer/v2/pkg/runtime/framework/plugins/volcano"
)

type Registry map[string]func(ctx context.Context, client client.Client, indexer client.FieldIndexer) (framework.Plugin, error)

func NewRegistry() Registry {
return Registry{
coscheduling.Name: coscheduling.New,
volcano.Name: volcano.New,
Comment thread
Doris-xm marked this conversation as resolved.
mpi.Name: mpi.New,
plainml.Name: plainml.New,
torch.Name: torch.New,
Expand Down
56 changes: 56 additions & 0 deletions pkg/runtime/framework/plugins/volcano/indexer.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
/*
Copyright 2024 The Kubeflow Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package volcano

import (
"sigs.k8s.io/controller-runtime/pkg/client"

trainer "github.com/kubeflow/trainer/v2/pkg/apis/trainer/v1alpha1"
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm @Electronic-Waste @tenzen-y @astefanutti Can we share indexer across all gang-scheduling plugins ?

Copy link
Copy Markdown
Contributor Author

@Doris-xm Doris-xm Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll meet indexer conflict panic if we use the same index key.

--- FAIL: TestNew (0.09s)
    --- FAIL: TestNew/positive_case (0.09s)
panic: indexer conflict: field .trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName for GroupVersionKind trainer.kubeflow.org/v1alpha1, Kind=TrainingRuntime is already indexed [recovered]

The root cause is that he framework initializes all registered plugins. Both the Coscheduling and Volcano plugins are being initialized during Framework.New:

fwk, err := fwkcore.New(ctx, c, fwkplugins.NewRegistry(), indexer)

for name, factory := range r {
plugin, err := factory(ctx, c, indexer)

Currently, I use different keys to avoid conflicts.

TrainingRuntimeContainerRuntimeClassKey = ".volcano.trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
ClusterTrainingRuntimeContainerRuntimeClassKey = ".volcano.clusterTrainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"

Is there a better solution? @andreyvelich @Electronic-Waste @rudeigerc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm Please can you explore if we can create a single indexer that we can re-use across Volcano and Coscheduling plugin ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do that. But still I think the problem is the conflict index key.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a single indexer:

But I keep two different keys to avoid conflicts as I stated above.

TrainingRuntimeContainerRuntimeClassKey = ".trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
ClusterTrainingRuntimeContainerRuntimeClassKey = ".clusterTrainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
VolcanoTrainingRuntimeContainerRuntimeClassKey = ".volcano.trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
VolcanoClusterTrainingRuntimeContainerRuntimeClassKey = ".volcano.clusterTrainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"

Still wonder if there's a better solution.

var (
TrainingRuntimeContainerRuntimeClassKey = ".trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
ClusterTrainingRuntimeContainerRuntimeClassKey = ".clusterTrainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
)

func IndexTrainingRuntimeContainerRuntimeClass(obj client.Object) []string {
runtime, ok := obj.(*trainer.TrainingRuntime)
if !ok {
return nil
}
var runtimeClasses []string
for _, rJob := range runtime.Spec.Template.Spec.ReplicatedJobs {
if rJob.Template.Spec.Template.Spec.RuntimeClassName != nil {
runtimeClasses = append(runtimeClasses, *rJob.Template.Spec.Template.Spec.RuntimeClassName)
}
}
return runtimeClasses
}

func IndexClusterTrainingRuntimeContainerRuntimeClass(obj client.Object) []string {
clRuntime, ok := obj.(*trainer.ClusterTrainingRuntime)
if !ok {
return nil
}
var runtimeClasses []string
for _, rJob := range clRuntime.Spec.Template.Spec.ReplicatedJobs {
if rJob.Template.Spec.Template.Spec.RuntimeClassName != nil {
runtimeClasses = append(runtimeClasses, *rJob.Template.Spec.Template.Spec.RuntimeClassName)
}
}
return runtimeClasses
}
Loading
Loading