KEP for GangConfig (to support Gang Scheduling)#1068
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
|
Hi @imreddy13. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
7a09bc3 to
da94057
Compare
|
/assign @kannon92 |
|
/assign @ahg-g |
|
@imreddy13: GitHub didn't allow me to assign the following users: ahg. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/unassign @danielvegamyhre |
|
/assign @ahg-g |
|
Thanks for this @imreddy13! IIUC, @soltysh and @helayoty discussed that changes to the Job API will be introduced at the later stage, after Workload API is available in v1.35: kubernetes/enhancements#5548 (comment) |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: imreddy13 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
@andreyvelich I discussed this with @wojtek-t and it's possible the Job API might differ in terms of naming. The contract is if a top level controller (JobSet/LWS) creates a Workload object, the Job controller will not create a new Workload or change the workloadRef in the pod spec. JobSet will not rely on Job controller to create Workloads since we only plan to have 1 Workload per JobSet in all modes. Is there anything specifically you are concerned about with Job? |
Are we going to validate that if users set gang scheduling policy in the JobSet spec, the Job API must be omitted ?
What is the recommendation for controllers on top of JobSet, like TrainJob controller ? Shall we rely on JobSet to create the Workload object, or top-level controllers should create it ?
My only concern is to make our APIs consistent with what we envision for Job controller in the future. |
|
|
||
| The JobSet API will support a new GangConfig field to specify if a JobSet, ReplicatedJob or Job replica should be gang scheduled. The user should set this field at the appropriate level (JobSet or ReplicatedJob) in the JobSet spec to indicate they require gang scheduling at that level. If the JobSet controller detects the GangConfig field in the JobSet spec, it will generate a single Workload object and associate the pod spec templates it generates with that Workload. | ||
|
|
||
| The Job controller should not create a new Workload or change the workloadRef in a pod spec if the JobSet controller has already set it. To differentiate the levels of gang scheduling (JobSet, ReplicatedJob or Job replica gangs), the JobSet controller will set different PodGroup and PodGroupReplicaIndex fields in the WorkloadReference spec per pod i.e.: |
There was a problem hiding this comment.
The Job controller should not create a new Workload or change the workloadRef in a pod spec if the JobSet controller has already set it.
Is this "Job controller" work in scope for this enhancement proposal ?
| - echo | ||
| - "started" | ||
| initialDelaySeconds: 5 | ||
| ``` |
There was a problem hiding this comment.
As an exercise in understanding the proposal: will this example results in all pods pointing to a Workload defined as the following?
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: sample-jobset
spec:
podGroups:
- name: podgroup
replicas: 1
policy:
kind: Gang
gang:
minCount: 16There was a problem hiding this comment.
+1
Also, it would be really helpful for understanding this KEP to have one example for each case (workload set by user, JobSet-level workload, Replicated-level workload, Job-level workload). Like, each example should have the JobSet manifest and the resulting objects.
There was a problem hiding this comment.
Added to a section "Translating JobSet Workloads to Workloads", ptal @ricardomaraschini
| - echo | ||
| - "started" | ||
| initialDelaySeconds: 5 | ||
| ``` |
There was a problem hiding this comment.
As an exercise in understanding the proposal: is this going to result in a Workload like this ?
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: sample-jobset
spec:
podGroups:
- name: replicated-job-1
replicas: 1
policy:
kind: Gang
gang:
minCount: 8
- name: replicated-job-2
replicas: 1
policy:
kind: Gang
gang:
minCount: 3With pods on replicated-job-1 and replicated-job-2 pointing to their respective (and different) pod groups ?
| - echo | ||
| - "started" | ||
| initialDelaySeconds: 5 | ||
| ``` |
There was a problem hiding this comment.
As an exercise in understanding the proposal: is this result in a Workload similar to the following ?
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: sample-jobset
spec:
podGroups:
- name: replicated-job-1
replicas: 2
policy:
kind: Gang
gang:
minCount: 4
- name: replicated-job-2
policy:
replicas: 3
kind: Gang
gang:
minCount: 3With pods on replicated-job-1 and replicated-job-2 pointing to their respective (and different) pod groups + a different PodGroupReplicaIndex for each replica ?
There was a problem hiding this comment.
It could help to better understand the mapping by adding workload and pod examples for all three (resp. two as the first use case is for users to provide) use cases right before ### Implementation so it's obvious how the corresponding mappings work. The pod manifest can be simplified by showing only the relevant fields. E.g. .name, .spec.workload
There was a problem hiding this comment.
Added a new section with examples: "Translating JobSet Workloads to Workloads"
|
|
||
| // GangConfig should be specified if all ReplicatedJobs in this JobSet | ||
| // should be scheduled as a gang (i.e. all at once). | ||
| GangConfig GangConfig json:"gangConfig,omitempty" |
There was a problem hiding this comment.
The examples above are using scheduleAtOnce and scheduleAtOncePerReplica instead.
|
|
||
| #### Workload Creation | ||
|
|
||
| The JobSet controller `Reconcile()` loop will be updated to generate a single Workload object per JobSet if the `GangConfig` field is specified on the JobSetSpec or ReplicatedJob. |
There was a problem hiding this comment.
The EP mentions above:
Note that in v1, we will only support the API change for ReplicatedJob and not JobSetSpec.
Is the plan to make the controller to create a Workload for the whole JobSetSpec when that is populated ?
|
|
||
| #### Workload Deletion | ||
|
|
||
| When the JobSet is deleted, the JobSet controller will also delete the Workload it created for that JobSet. |
There was a problem hiding this comment.
Was using owner references considered here ?
There was a problem hiding this comment.
+1 to using owner references. It's more robust and easier to implement.
| #### Defaulting/Validation | ||
|
|
||
| - `jobSetSpec.GangConfig.GangMode` and `replicatedJob.GangConfig.GangMode` cannot be specified at once (except to `GangModeOff`). | ||
| - `jobSetSpec.GangConfig.GangMode` and `replicatedJob.GangConfig.GangMode` are immutable. |
There was a problem hiding this comment.
jobSetSpec.GangConfig.GangMode == "Gang" with multiple replicaJobs who leverage DependsOn seems to be a needed validation here.
| name: sample-jobset | ||
| spec: | ||
| replicatedJobs: | ||
| - name: replicated-job-1 |
| 1. `jobSetSpec.GangConfig.GangMode` options: | ||
| - `GangModeOff`: no gang scheduling | ||
| - `GangModeGang`: all pods in the JobSet are in a gang | ||
| 2. `replicatedJob.GangConfig.GangMode` options: |
There was a problem hiding this comment.
replicatedJob.GangConfig.GangMode -> replicatedJob[].GangConfig.GangMode (to comply with jq accessing an array)
Might make sense to prepend beginning of each path with .. E.g. .jobSetSpec.GangConfig.GangMode
| - echo | ||
| - "started" | ||
| initialDelaySeconds: 5 | ||
| ``` |
There was a problem hiding this comment.
It could help to better understand the mapping by adding workload and pod examples for all three (resp. two as the first use case is for users to provide) use cases right before ### Implementation so it's obvious how the corresponding mappings work. The pod manifest can be simplified by showing only the relevant fields. E.g. .name, .spec.workload
|
|
||
| // GangConfig should be specified if all ReplicatedJobs in this JobSet | ||
| // should be scheduled as a gang (i.e. all at once). | ||
| GangConfig GangConfig json:"gangConfig,omitempty" |
|
Should the controller generate an event every time a workload object is created/updated/deleted? |
|
|
||
| ## Proposal | ||
|
|
||
| The JobSet API will support a new GangConfig field to specify if a JobSet, ReplicatedJob or Job replica should be gang scheduled. The user should set this field at the appropriate level (JobSet or ReplicatedJob) in the JobSet spec to indicate they require gang scheduling at that level. If the JobSet controller detects the GangConfig field in the JobSet spec, it will generate a single Workload object and associate the pod spec templates it generates with that Workload. |
There was a problem hiding this comment.
if a JobSet, ReplicatedJob or Job replica should be gang scheduled.
Worth mentioning a replicated job is a collection of jobs. Resp. a replicated job consists of one or two job replicas (I wonder what's the right definition here). So the hierarchy is quickly obvious. A diagram/picture of the hierarchy would help a lot in speeding up the first understanding. E.g. slightly updating https://kubernetes.io/blog/2025/03/23/introducing-jobset/#how-jobset-works and drawing the available gang groups.
That's a good question. IMO it makes sense to let the uppermost workload-like object (JobSet, Job, etc) to create the Workload object. For instance, if we have a stack like
it makes sense for JobSet to create the workload object. If instead we have
It makes sense for Job to create the workload object. |
| ### Graduation Criteria | ||
|
|
||
| - `kube-scheduler` changes and `Workload` API alpha are targeted to `1.35`. | ||
| - Support for `replicatedJob.GangConfig` will be alpha in 1.35 behind an alpha feature gate and graduated to stable when `Workload` API is stable. |
There was a problem hiding this comment.
We should have a feature gate for JobSet with this KEP that matches the name of the gang scheduling feature gate.
But we will want to keep this feature disabled by default until workload API is stable on all supported k8s versions of JobSet.
We will need to introduce feature gate handling in JobSet but I think this feature warrants this if we want to merge this in 1.35.
| 1. If `jobSetSpec.GangConfig.GangMode == GangModeGang`, all JobSet pods are one gang: | ||
| - Create a single `PodGroup` for the JobSet. | ||
| - Set `Replicas` to 1 | ||
| - Set the `minCount` in the `Gang` to `#replicated jobs * #job replicas * #pods per Job (parallelism)`. (`minCount` is the number of pods that should be ready before the gang is scheduled). |
There was a problem hiding this comment.
Technically it's sum(replicatedJob.replicas * replicatedJob.template.spec.parallelism for replicatedJob in jobSet.spec.replicatedJobs)
| spec: | ||
| workload: | ||
| name: w-sample-jobset | ||
| podGroup: pg-sample-jobset |
There was a problem hiding this comment.
nit: the indentation is off
| metadata: | ||
| name: sample-jobset | ||
| spec: | ||
| scheduleAtOnce: true |
There was a problem hiding this comment.
Rather than having this API in the rJob and JobSet level, could we just define PodGroupPolicies API which has the target jobs parameter? Something like this:
podGroupPolicies:
- name: initializer
targetReplicatedJobs: ["dataset-initializer"]
- name: mpi-trainer
targetReplicatedJobs: ["launcher", "node"]Single Workload for all rJob:
podGroupPolicies:
- name: gang-group
targetReplicatedJobs: []That will make us consistent with other JobSet APIs and VolumeClaimPolicies: #1062
WDYT @imreddy13 @GiuseppeTT @kannon92 @tenzen-y @astefanutti ?
|
@andreyvelich @imreddy13 @GiuseppeTT @kannon92 @tenzen-y @astefanutti I put some thoughts together about how the Workload API should be integrated with different true workoads (i.e., Job, JobSet, LWS, etc...) and I really appreciate your feedback. |
In thinking through this, I wanted to sketch out design/implementation for what I am thinking for Workload integration. |
|
/retest One of the failures is a flake. The other is due to the table of content not being up to date |
|
Hey @imreddy13 K8s should release 1.36 tomorrow so I think we can start this work up. Are you still able to take this on? |
|
@imreddy13: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Proposal to add a new field
GangConfigto ReplicatedJob API to support Gang Scheduling (KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling)Which issue(s) this PR fixes:
Fixes #969
Does this PR introduce a user-facing change?