Skip to content

Commit c64eace

Browse files
Doris-xmandreyvelichElectronic-Wastetenzen-y
authored andcommitted
feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (kubeflow#2729)
* feat: api for volcano scheduling plugin Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: init volcano-plugin Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: init test file Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: register volcano plugin Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: deal with minTaskMember, minMember, NetworkTopo Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: calculate of minResource Signed-off-by: Xinmin Du <2812493086@qq.com> * test: build PodGroup test Signed-off-by: Xinmin Du <2812493086@qq.com> * refactor: separate to 2 prs(build&handler) Signed-off-by: Xinmin Du <2812493086@qq.com> * test: add test for new&reconcile_builder Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: typo Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: trainer/v2 import Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: networktopo type Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: OpenAPI validation errors Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: remove minTaskMembers Signed-off-by: Xinmin Du <2812493086@qq.com> * test: test coverage 100% Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: update apis Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: replace testify Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: registry Volcano CRDs to the scheme Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: add volcano to scheme Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: fix networktopo schema Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: add networktopo spec in trainer Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: unit test Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: import networkTopo directly Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: make generate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: make generate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: golangci-lint Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: golangci-lint Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add volcano installation in integration test Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: filter volcano api Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: get volcano.podgroup with local version Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: init test env with volcano podgroup installed Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: check plugin in enforcePodgroupPolicy Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: group-name label in unit test Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: ReconcilerBuilders Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add PodGroupHandler Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: unit test for handlers Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: group name annotation Signed-off-by: Xinmin Du <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: no need to delete RBAC Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/volcano/indexer.go Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: nil checking for trainjob Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/volcano/volcano.go Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: make generate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: index conflict Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/coscheduling/coscheduling.go Co-authored-by: Shao Wang <2690692950@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: update volcano to v1.12.2 Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: re-use indexer Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add validate Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: no scheduler when coscheduling is nil Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: put group-name in annotations Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: validate if priorityClass installed Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: propagate annotations to pod Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: integration test for volcano Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: golangci-lint check Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: use shared indexer Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: remove indexer to runtime/ Signed-off-by: Xinmin Du <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: append owner reference & missing import Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: rewrite volcano UT Signed-off-by: Xinmin Du <2812493086@qq.com> * feat: add copyright Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: sync RBAC to Helm charts Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: refactor UTs Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: test validation separately Signed-off-by: Xinmin Du <2812493086@qq.com> * Update hack/swagger/main.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * fix: refactor TestVolcano Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: refactor TestValidate Signed-off-by: Xinmin Du <2812493086@qq.com> * Update pkg/runtime/framework/plugins/volcano/volcano_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> --------- Signed-off-by: Xinmin Du <2812493086@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Shao Wang <2690692950@qq.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
1 parent 56ab163 commit c64eace

43 files changed

Lines changed: 3168 additions & 415 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Makefile

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,15 @@ scheduler-plugins-crd: ## Copy the CRDs from the Scheduler Plugins repository to
111111
mkdir -p $(EXTERNAL_CRDS_DIR)/scheduler-plugins/
112112
cp -f $(SCHEDULER_PLUGINS_ROOT)/manifests/coscheduling/* $(EXTERNAL_CRDS_DIR)/scheduler-plugins
113113

114+
VOLCANO_APIS_ROOT = $(shell go list -m -f "{{.Dir}}" volcano.sh/apis)
115+
VOLCANO_VERSION = $(shell basename $(VOLCANO_APIS_ROOT) | cut -d'@' -f2)
116+
VOLCANO_CRD_URL = https://raw.githubusercontent.com/volcano-sh/volcano/$(VOLCANO_VERSION)/config/crd/volcano/bases/scheduling.volcano.sh_podgroups.yaml
117+
118+
.PHONY: volcano-crd
119+
volcano-crd: ## Copy the CRDs from Volcano repository to the manifests/external-crds directory.
120+
mkdir -p $(EXTERNAL_CRDS_DIR)/volcano/
121+
curl -sSL $(VOLCANO_CRD_URL) -o $(EXTERNAL_CRDS_DIR)/volcano/scheduling.volcano.sh_podgroups.yaml
122+
114123
# Instructions for code generation.
115124
.PHONY: manifests
116125
manifests: controller-gen ## Generate manifests.
@@ -155,7 +164,7 @@ test: ## Run Go unit test.
155164
go test $(shell go list ./... | grep -Ev '/(test|cmd|hack|pkg/apis|pkg/client|pkg/util/testing)') -coverprofile cover.out
156165

157166
.PHONY: test-integration
158-
test-integration: ginkgo envtest jobset-operator-crd scheduler-plugins-crd ## Run Go integration test.
167+
test-integration: ginkgo envtest jobset-operator-crd scheduler-plugins-crd volcano-crd ## Run Go integration test.
159168
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(K8S_VERSION) -p path)" $(GINKGO) -v ./test/integration/...
160169

161170
.PHONY: test-python

api/openapi-spec/swagger.json

Lines changed: 44 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/__init__.py

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/scheduling_v1beta1_network_topology_spec.py

Lines changed: 89 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_group_policy.py

Lines changed: 8 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_group_policy_source.py

Lines changed: 8 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_volcano_pod_group_policy_source.py

Lines changed: 91 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -627,6 +627,29 @@ spec:
627627
format: int32
628628
type: integer
629629
type: object
630+
volcano:
631+
description: Volcano plugin for gang-scheduling.
632+
properties:
633+
networkTopology:
634+
description: NetworkTopology defines the NetworkTopology config,
635+
this field works in conjunction with network topology feature
636+
and hyperNode CRD.
637+
properties:
638+
highestTierAllowed:
639+
default: 1
640+
description: HighestTierAllowed specifies the highest
641+
tier that a job allowed to cross when scheduling.
642+
type: integer
643+
mode:
644+
default: hard
645+
description: Mode specifies the mode of the network topology
646+
constrain.
647+
enum:
648+
- hard
649+
- soft
650+
type: string
651+
type: object
652+
type: object
630653
type: object
631654
template:
632655
description: JobSet template which will be used by TrainJob.

0 commit comments

Comments
 (0)