Skip to content
Merged
Show file tree
Hide file tree
Changes from 72 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
8867e71
feat: api for volcano scheduling plugin
Doris-xm Jun 16, 2025
0cd20fc
feat: init volcano-plugin
Doris-xm Jun 29, 2025
2e911ef
feat: init test file
Doris-xm Jun 29, 2025
ed0425f
feat: register volcano plugin
Doris-xm Jun 30, 2025
fca9f40
feat: deal with minTaskMember, minMember, NetworkTopo
Doris-xm Jul 13, 2025
8ab5d71
fix: calculate of minResource
Doris-xm Jul 13, 2025
bd60987
test: build PodGroup test
Doris-xm Jul 13, 2025
ec6c5a8
refactor: separate to 2 prs(build&handler)
Doris-xm Jul 14, 2025
f182b65
test: add test for new&reconcile_builder
Doris-xm Jul 14, 2025
ffa27f9
fix: typo
Doris-xm Jul 14, 2025
f8ea7dd
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Jul 14, 2025
4fa3b6a
fix: trainer/v2 import
Doris-xm Jul 14, 2025
1a247da
fix: networktopo type
Doris-xm Jul 14, 2025
73bb476
fix: OpenAPI validation errors
Doris-xm Jul 14, 2025
b92383a
fix: remove minTaskMembers
Doris-xm Jul 28, 2025
fe6ffd0
test: test coverage 100%
Doris-xm Jul 28, 2025
1c33eba
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Aug 3, 2025
7cbad55
feat: update apis
Doris-xm Aug 4, 2025
cdd9309
feat: replace testify
Doris-xm Aug 4, 2025
8e68bba
fix: registry Volcano CRDs to the scheme
Doris-xm Aug 4, 2025
79618cb
fix: add volcano to scheme
Doris-xm Aug 10, 2025
8814111
fix: fix networktopo schema
Doris-xm Aug 18, 2025
114f11b
fix: add networktopo spec in trainer
Doris-xm Aug 18, 2025
c8fa0fd
fix: unit test
Doris-xm Aug 20, 2025
f8d8912
feat: import networkTopo directly
Doris-xm Aug 20, 2025
c084ac8
fix: make generate
Doris-xm Aug 21, 2025
f65d7a7
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Aug 21, 2025
aa90695
fix: make generate
Doris-xm Aug 21, 2025
83c0585
fix: golangci-lint
Doris-xm Aug 21, 2025
6f24588
fix: golangci-lint
Doris-xm Aug 25, 2025
d2fd159
feat: add volcano installation in integration test
Doris-xm Aug 25, 2025
e95a158
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Aug 25, 2025
d582a81
fix: filter volcano api
Doris-xm Sep 1, 2025
fe19174
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Sep 1, 2025
deafc5d
fix: get volcano.podgroup with local version
Doris-xm Sep 1, 2025
d8bae4a
fix: init test env with volcano podgroup installed
Doris-xm Sep 1, 2025
cf578a2
fix: check plugin in enforcePodgroupPolicy
Doris-xm Sep 6, 2025
a332de4
fix: group-name label in unit test
Doris-xm Sep 6, 2025
a4f09e6
fix: ReconcilerBuilders
Doris-xm Sep 7, 2025
71996b6
feat: add PodGroupHandler
Doris-xm Sep 8, 2025
99e0462
feat: unit test for handlers
Doris-xm Sep 9, 2025
583b1d6
fix: group name annotation
Doris-xm Sep 15, 2025
e60a5a9
Update hack/swagger/main.go
Doris-xm Sep 15, 2025
ef25760
fix: no need to delete RBAC
Doris-xm Sep 15, 2025
28bcd41
Update pkg/runtime/framework/plugins/volcano/indexer.go
Doris-xm Sep 15, 2025
e2b1d89
fix: nil checking for trainjob
Doris-xm Sep 15, 2025
d037828
Update pkg/runtime/framework/plugins/volcano/volcano.go
Doris-xm Sep 15, 2025
aaf47a2
Merge remote-tracking branch 'origin/volcano-podgroup-build' into vol…
Doris-xm Sep 15, 2025
9c1c5e0
fix: make generate
Doris-xm Sep 15, 2025
8497e6b
fix: index conflict
Doris-xm Sep 16, 2025
67bbedb
Update pkg/runtime/framework/plugins/coscheduling/coscheduling.go
Doris-xm Sep 16, 2025
04aa3e8
fix: update volcano to v1.12.2
Doris-xm Sep 16, 2025
e044c4f
Merge remote-tracking branch 'origin/volcano-podgroup-build' into vol…
Doris-xm Sep 16, 2025
167a595
feat: re-use indexer
Doris-xm Sep 16, 2025
d52581e
feat: add validate
Doris-xm Sep 19, 2025
a8be8ca
Merge branch 'refs/heads/master' into volcano-podgroup-build
Doris-xm Sep 19, 2025
8756dc7
fix: no scheduler when coscheduling is nil
Doris-xm Sep 19, 2025
7ba807e
fix: put group-name in annotations
Doris-xm Sep 19, 2025
6075a0c
feat: validate if priorityClass installed
Doris-xm Sep 21, 2025
132bb16
feat: propagate annotations to pod
Doris-xm Sep 22, 2025
7065606
feat: integration test for volcano
Doris-xm Sep 23, 2025
e6c7646
fix: golangci-lint check
Doris-xm Sep 23, 2025
2eb8629
feat: use shared indexer
Doris-xm Sep 28, 2025
4ede544
feat: remove indexer to runtime/
Doris-xm Sep 28, 2025
ef69e38
Update hack/swagger/main.go
Doris-xm Oct 1, 2025
63393c5
Update hack/swagger/main.go
Doris-xm Oct 1, 2025
2fad831
fix: append owner reference & missing import
Doris-xm Oct 1, 2025
f0e5a4c
fix: rewrite volcano UT
Doris-xm Oct 1, 2025
9c1eae1
feat: add copyright
Doris-xm Oct 1, 2025
e185049
fix: sync RBAC to Helm charts
Doris-xm Oct 1, 2025
497a0b6
fix: refactor UTs
Doris-xm Oct 2, 2025
4030ce5
fix: test validation separately
Doris-xm Oct 3, 2025
c886583
Update hack/swagger/main.go
Doris-xm Oct 4, 2025
451bf6e
fix: refactor TestVolcano
Doris-xm Oct 5, 2025
49bd5b9
Merge remote-tracking branch 'origin/volcano-podgroup-build' into vol…
Doris-xm Oct 5, 2025
4175fef
fix: refactor TestValidate
Doris-xm Oct 5, 2025
1f05304
Update pkg/runtime/framework/plugins/volcano/volcano_test.go
Doris-xm Oct 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,15 @@ scheduler-plugins-crd: ## Copy the CRDs from the Scheduler Plugins repository to
mkdir -p $(EXTERNAL_CRDS_DIR)/scheduler-plugins/
cp -f $(SCHEDULER_PLUGINS_ROOT)/manifests/coscheduling/* $(EXTERNAL_CRDS_DIR)/scheduler-plugins

VOLCANO_APIS_ROOT = $(shell go list -m -f "{{.Dir}}" volcano.sh/apis)
VOLCANO_VERSION = $(shell basename $(VOLCANO_APIS_ROOT) | cut -d'@' -f2)
VOLCANO_CRD_URL = https://raw.githubusercontent.com/volcano-sh/volcano/$(VOLCANO_VERSION)/config/crd/volcano/bases/scheduling.volcano.sh_podgroups.yaml
Comment thread
Doris-xm marked this conversation as resolved.

.PHONY: volcano-crd
volcano-crd: ## Copy the CRDs from Volcano repository to the manifests/external-crds directory.
mkdir -p $(EXTERNAL_CRDS_DIR)/volcano/
curl -sSL $(VOLCANO_CRD_URL) -o $(EXTERNAL_CRDS_DIR)/volcano/scheduling.volcano.sh_podgroups.yaml

# Instructions for code generation.
.PHONY: manifests
manifests: controller-gen ## Generate manifests.
Expand Down Expand Up @@ -155,7 +164,7 @@ test: ## Run Go unit test.
go test $(shell go list ./... | grep -Ev '/(test|cmd|hack|pkg/apis|pkg/client|pkg/util/testing)') -coverprofile cover.out

.PHONY: test-integration
test-integration: ginkgo envtest jobset-operator-crd scheduler-plugins-crd ## Run Go integration test.
test-integration: ginkgo envtest jobset-operator-crd scheduler-plugins-crd volcano-crd ## Run Go integration test.
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(K8S_VERSION) -p path)" $(GINKGO) -v ./test/integration/...

.PHONY: test-python
Expand Down
44 changes: 44 additions & 0 deletions api/openapi-spec/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -13504,6 +13504,20 @@
}
}
},
"scheduling.v1beta1.NetworkTopologySpec": {
"type": "object",
"properties": {
"highestTierAllowed": {
"description": "HighestTierAllowed specifies the highest tier that a job allowed to cross when scheduling.",
"type": "integer",
"format": "int32"
},
"mode": {
"description": "Mode specifies the mode of the network topology constrain.",
"type": "string"
}
}
},
"trainer.v1alpha1.ClusterTrainingRuntime": {
"description": "ClusterTrainingRuntime represents a training runtime which can be referenced as part of `runtimeRef` API in TrainJob. This resource is a cluster-scoped and can be referenced by TrainJob that created in *any* namespace.",
"type": "object",
Expand Down Expand Up @@ -13876,6 +13890,14 @@
"$ref": "#/components/schemas/trainer.v1alpha1.CoschedulingPodGroupPolicySource"
}
]
},
"volcano": {
"description": "Volcano plugin for gang-scheduling.",
"allOf": [
{
"$ref": "#/components/schemas/trainer.v1alpha1.VolcanoPodGroupPolicySource"
}
]
}
}
},
Expand All @@ -13890,6 +13912,14 @@
"$ref": "#/components/schemas/trainer.v1alpha1.CoschedulingPodGroupPolicySource"
}
]
},
"volcano": {
"description": "Volcano plugin for gang-scheduling.",
"allOf": [
{
"$ref": "#/components/schemas/trainer.v1alpha1.VolcanoPodGroupPolicySource"
}
]
}
}
},
Expand Down Expand Up @@ -14477,6 +14507,20 @@
]
}
}
},
"trainer.v1alpha1.VolcanoPodGroupPolicySource": {
"description": "VolcanoPodGroupPolicySource represents configuration for the Volcano gang-scheduler.",
"type": "object",
"properties": {
"networkTopology": {
"description": "NetworkTopology defines the NetworkTopology config, this field works in conjunction with network topology feature and hyperNode CRD.",
"allOf": [
{
"$ref": "#/components/schemas/scheduling.v1beta1.NetworkTopologySpec"
}
]
}
}
}
}
}
Expand Down
2 changes: 2 additions & 0 deletions api/python_api/kubeflow_trainer_api/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,7 @@
from kubeflow_trainer_api.models.jobset_v1alpha2_replicated_job_status import JobsetV1alpha2ReplicatedJobStatus
from kubeflow_trainer_api.models.jobset_v1alpha2_startup_policy import JobsetV1alpha2StartupPolicy
from kubeflow_trainer_api.models.jobset_v1alpha2_success_policy import JobsetV1alpha2SuccessPolicy
from kubeflow_trainer_api.models.scheduling_v1beta1_network_topology_spec import SchedulingV1beta1NetworkTopologySpec
from kubeflow_trainer_api.models.trainer_v1alpha1_cluster_training_runtime import TrainerV1alpha1ClusterTrainingRuntime
from kubeflow_trainer_api.models.trainer_v1alpha1_cluster_training_runtime_list import TrainerV1alpha1ClusterTrainingRuntimeList
from kubeflow_trainer_api.models.trainer_v1alpha1_container_override import TrainerV1alpha1ContainerOverride
Expand Down Expand Up @@ -379,3 +380,4 @@
from kubeflow_trainer_api.models.trainer_v1alpha1_training_runtime import TrainerV1alpha1TrainingRuntime
from kubeflow_trainer_api.models.trainer_v1alpha1_training_runtime_list import TrainerV1alpha1TrainingRuntimeList
from kubeflow_trainer_api.models.trainer_v1alpha1_training_runtime_spec import TrainerV1alpha1TrainingRuntimeSpec
from kubeflow_trainer_api.models.trainer_v1alpha1_volcano_pod_group_policy_source import TrainerV1alpha1VolcanoPodGroupPolicySource
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# coding: utf-8

"""
Kubeflow Trainer OpenAPI Spec

No description provided (generated by Openapi Generator https://github.com/openapitools/openapi-generator)

The version of the OpenAPI document: unversioned
Generated by OpenAPI Generator (https://openapi-generator.tech)

Do not edit the class manually.
""" # noqa: E501


from __future__ import annotations
import pprint
import re # noqa: F401
import json

from pydantic import BaseModel, ConfigDict, Field, StrictInt, StrictStr
from typing import Any, ClassVar, Dict, List, Optional
from typing import Optional, Set
from typing_extensions import Self

class SchedulingV1beta1NetworkTopologySpec(BaseModel):
"""
SchedulingV1beta1NetworkTopologySpec
""" # noqa: E501
highest_tier_allowed: Optional[StrictInt] = Field(default=None, description="HighestTierAllowed specifies the highest tier that a job allowed to cross when scheduling.", alias="highestTierAllowed")
mode: Optional[StrictStr] = Field(default=None, description="Mode specifies the mode of the network topology constrain.")
__properties: ClassVar[List[str]] = ["highestTierAllowed", "mode"]

model_config = ConfigDict(
populate_by_name=True,
validate_assignment=True,
protected_namespaces=(),
)


def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.model_dump(by_alias=True))

def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
# TODO: pydantic v2: use .model_dump_json(by_alias=True, exclude_unset=True) instead
return json.dumps(self.to_dict())

@classmethod
def from_json(cls, json_str: str) -> Optional[Self]:
"""Create an instance of SchedulingV1beta1NetworkTopologySpec from a JSON string"""
return cls.from_dict(json.loads(json_str))

def to_dict(self) -> Dict[str, Any]:
"""Return the dictionary representation of the model using alias.

This has the following differences from calling pydantic's
`self.model_dump(by_alias=True)`:

* `None` is only added to the output dict for nullable fields that
were set at model initialization. Other fields with value `None`
are ignored.
"""
excluded_fields: Set[str] = set([
])

_dict = self.model_dump(
by_alias=True,
exclude=excluded_fields,
exclude_none=True,
)
return _dict

@classmethod
def from_dict(cls, obj: Optional[Dict[str, Any]]) -> Optional[Self]:
"""Create an instance of SchedulingV1beta1NetworkTopologySpec from a dict"""
if obj is None:
return None

if not isinstance(obj, dict):
return cls.model_validate(obj)

_obj = cls.model_validate({
"highestTierAllowed": obj.get("highestTierAllowed"),
"mode": obj.get("mode")
})
return _obj


Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from pydantic import BaseModel, ConfigDict, Field
from typing import Any, ClassVar, Dict, List, Optional
from kubeflow_trainer_api.models.trainer_v1alpha1_coscheduling_pod_group_policy_source import TrainerV1alpha1CoschedulingPodGroupPolicySource
from kubeflow_trainer_api.models.trainer_v1alpha1_volcano_pod_group_policy_source import TrainerV1alpha1VolcanoPodGroupPolicySource
from typing import Optional, Set
from typing_extensions import Self

Expand All @@ -28,7 +29,8 @@ class TrainerV1alpha1PodGroupPolicy(BaseModel):
PodGroupPolicy represents a PodGroup configuration for gang-scheduling.
""" # noqa: E501
coscheduling: Optional[TrainerV1alpha1CoschedulingPodGroupPolicySource] = Field(default=None, description="Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling.")
__properties: ClassVar[List[str]] = ["coscheduling"]
volcano: Optional[TrainerV1alpha1VolcanoPodGroupPolicySource] = Field(default=None, description="Volcano plugin for gang-scheduling.")
__properties: ClassVar[List[str]] = ["coscheduling", "volcano"]

model_config = ConfigDict(
populate_by_name=True,
Expand Down Expand Up @@ -72,6 +74,9 @@ def to_dict(self) -> Dict[str, Any]:
# override the default output from pydantic by calling `to_dict()` of coscheduling
if self.coscheduling:
_dict['coscheduling'] = self.coscheduling.to_dict()
# override the default output from pydantic by calling `to_dict()` of volcano
if self.volcano:
_dict['volcano'] = self.volcano.to_dict()
return _dict

@classmethod
Expand All @@ -84,7 +89,8 @@ def from_dict(cls, obj: Optional[Dict[str, Any]]) -> Optional[Self]:
return cls.model_validate(obj)

_obj = cls.model_validate({
"coscheduling": TrainerV1alpha1CoschedulingPodGroupPolicySource.from_dict(obj["coscheduling"]) if obj.get("coscheduling") is not None else None
"coscheduling": TrainerV1alpha1CoschedulingPodGroupPolicySource.from_dict(obj["coscheduling"]) if obj.get("coscheduling") is not None else None,
"volcano": TrainerV1alpha1VolcanoPodGroupPolicySource.from_dict(obj["volcano"]) if obj.get("volcano") is not None else None
})
return _obj

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from pydantic import BaseModel, ConfigDict, Field
from typing import Any, ClassVar, Dict, List, Optional
from kubeflow_trainer_api.models.trainer_v1alpha1_coscheduling_pod_group_policy_source import TrainerV1alpha1CoschedulingPodGroupPolicySource
from kubeflow_trainer_api.models.trainer_v1alpha1_volcano_pod_group_policy_source import TrainerV1alpha1VolcanoPodGroupPolicySource
from typing import Optional, Set
from typing_extensions import Self

Expand All @@ -28,7 +29,8 @@ class TrainerV1alpha1PodGroupPolicySource(BaseModel):
PodGroupPolicySource represents supported plugins for gang-scheduling. Only one of its members may be specified.
""" # noqa: E501
coscheduling: Optional[TrainerV1alpha1CoschedulingPodGroupPolicySource] = Field(default=None, description="Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling.")
__properties: ClassVar[List[str]] = ["coscheduling"]
volcano: Optional[TrainerV1alpha1VolcanoPodGroupPolicySource] = Field(default=None, description="Volcano plugin for gang-scheduling.")
__properties: ClassVar[List[str]] = ["coscheduling", "volcano"]

model_config = ConfigDict(
populate_by_name=True,
Expand Down Expand Up @@ -72,6 +74,9 @@ def to_dict(self) -> Dict[str, Any]:
# override the default output from pydantic by calling `to_dict()` of coscheduling
if self.coscheduling:
_dict['coscheduling'] = self.coscheduling.to_dict()
# override the default output from pydantic by calling `to_dict()` of volcano
if self.volcano:
_dict['volcano'] = self.volcano.to_dict()
return _dict

@classmethod
Expand All @@ -84,7 +89,8 @@ def from_dict(cls, obj: Optional[Dict[str, Any]]) -> Optional[Self]:
return cls.model_validate(obj)

_obj = cls.model_validate({
"coscheduling": TrainerV1alpha1CoschedulingPodGroupPolicySource.from_dict(obj["coscheduling"]) if obj.get("coscheduling") is not None else None
"coscheduling": TrainerV1alpha1CoschedulingPodGroupPolicySource.from_dict(obj["coscheduling"]) if obj.get("coscheduling") is not None else None,
"volcano": TrainerV1alpha1VolcanoPodGroupPolicySource.from_dict(obj["volcano"]) if obj.get("volcano") is not None else None
})
return _obj

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# coding: utf-8

"""
Kubeflow Trainer OpenAPI Spec

No description provided (generated by Openapi Generator https://github.com/openapitools/openapi-generator)

The version of the OpenAPI document: unversioned
Generated by OpenAPI Generator (https://openapi-generator.tech)

Do not edit the class manually.
""" # noqa: E501


from __future__ import annotations
import pprint
import re # noqa: F401
import json

from pydantic import BaseModel, ConfigDict, Field
from typing import Any, ClassVar, Dict, List, Optional
from kubeflow_trainer_api.models.scheduling_v1beta1_network_topology_spec import SchedulingV1beta1NetworkTopologySpec
from typing import Optional, Set
from typing_extensions import Self

class TrainerV1alpha1VolcanoPodGroupPolicySource(BaseModel):
"""
VolcanoPodGroupPolicySource represents configuration for the Volcano gang-scheduler.
""" # noqa: E501
network_topology: Optional[SchedulingV1beta1NetworkTopologySpec] = Field(default=None, description="NetworkTopology defines the NetworkTopology config, this field works in conjunction with network topology feature and hyperNode CRD.", alias="networkTopology")
__properties: ClassVar[List[str]] = ["networkTopology"]

model_config = ConfigDict(
populate_by_name=True,
validate_assignment=True,
protected_namespaces=(),
)


def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.model_dump(by_alias=True))

def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
# TODO: pydantic v2: use .model_dump_json(by_alias=True, exclude_unset=True) instead
return json.dumps(self.to_dict())

@classmethod
def from_json(cls, json_str: str) -> Optional[Self]:
"""Create an instance of TrainerV1alpha1VolcanoPodGroupPolicySource from a JSON string"""
return cls.from_dict(json.loads(json_str))

def to_dict(self) -> Dict[str, Any]:
"""Return the dictionary representation of the model using alias.

This has the following differences from calling pydantic's
`self.model_dump(by_alias=True)`:

* `None` is only added to the output dict for nullable fields that
were set at model initialization. Other fields with value `None`
are ignored.
"""
excluded_fields: Set[str] = set([
])

_dict = self.model_dump(
by_alias=True,
exclude=excluded_fields,
exclude_none=True,
)
# override the default output from pydantic by calling `to_dict()` of network_topology
if self.network_topology:
_dict['networkTopology'] = self.network_topology.to_dict()
return _dict

@classmethod
def from_dict(cls, obj: Optional[Dict[str, Any]]) -> Optional[Self]:
"""Create an instance of TrainerV1alpha1VolcanoPodGroupPolicySource from a dict"""
if obj is None:
return None

if not isinstance(obj, dict):
return cls.model_validate(obj)

_obj = cls.model_validate({
"networkTopology": SchedulingV1beta1NetworkTopologySpec.from_dict(obj["networkTopology"]) if obj.get("networkTopology") is not None else None
})
return _obj


Original file line number Diff line number Diff line change
Expand Up @@ -627,6 +627,29 @@ spec:
format: int32
type: integer
type: object
volcano:
description: Volcano plugin for gang-scheduling.
properties:
networkTopology:
description: NetworkTopology defines the NetworkTopology config,
this field works in conjunction with network topology feature
and hyperNode CRD.
properties:
highestTierAllowed:
default: 1
description: HighestTierAllowed specifies the highest
tier that a job allowed to cross when scheduling.
type: integer
mode:
default: hard
description: Mode specifies the mode of the network topology
constrain.
enum:
- hard
- soft
type: string
type: object
type: object
type: object
template:
description: JobSet template which will be used by TrainJob.
Expand Down
Loading
Loading