[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219
[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219andreyvelich wants to merge 9 commits into
Conversation
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mm4tt, vsoch, dom4ha, VassilisVassiliadis, helayoty, wojtek-t, klueska. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds an initial KEP/proposal document for integrating Kubernetes Workload API–based workload-aware (gang) scheduling into Kubeflow Trainer’s TrainJob via a new PodGroupPolicy plugin, as groundwork for future implementation.
Changes:
- Introduces a new KEP (3015) describing Workload/PodGroup orchestration for TrainJob scheduling.
- Documents intended API surface (
podGroupPolicy.workload) and controller/plugin responsibilities across build/enforcement/watch phases. - Outlines lifecycle, RBAC, feature gates, and a test plan for the proposed integration.
Pull Request Test Coverage Report for Build 22145283120Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
I'm a little caught up on this discussion: kubernetes-sigs/jobset#1068 (comment) If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet? |
Its a hard discussion to follow. The general stance now is that the api that a user creates (sometimes known as true workload) should probably manage the podGroups / Workloads. For Trainer we could maybe consider delegating this to JobSet for JobSet based TrainingRuntimes but that doesn't solve the problem for Flux or whatever other backend we have.
The main outcome would be that if there exists a workloadRef or PodGroupRef on the pod template we are to assume that the WAS scheduled is "handled". But we haven't updated the KEP for JobSet. That was at least what we were discussing on all the features. |
|
The FluxPolicy implementation is just tweaking a JobSet, so I think if support was added to that underlying abstraction it would be inherited by Flux. I want to make sure we are careful to (essentially) not implement this layer twice, but rather just once at the most appropriate level (maybe JobSet?) |
|
Yes, @kannon92 is right – we’ve had several discussions about this over the past few months. In general, higher-level controllers should manage Workload and PodGroup objects, since they have the most complete understanding of the AI application they are orchestrating. For example:
For now, I suggest that we add validation in the TrainJob controller to ensure users do not configure the Workload API directly in JobSet or Job objects. |
mm4tt
left a comment
There was a problem hiding this comment.
Huge thanks for working on this integration. Having working examples of WAS being adopted by real-workload controllers gives us the confidence we need to push WAS to Beta in 1.37. Great stuff!
|
|
||
| // Workload plugin using native Kubernetes Workload API for gang-scheduling | ||
| // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled. | ||
| Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` |
There was a problem hiding this comment.
As discussed, this approach differs from the Job KEP (kubernetes/enhancements#5871), where we deliberately avoided Job API changes and simply enabled gang-scheduling by default in specific cases.
That said, introducing an explicit opt-in API here makes sense for now, as it naturally fits into the existing structure (e.g., gang-scheduling via coscheduling or Volcano).
One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling. We want to avoid a fragmented ecosystem where every controller exposes its own API for WAS. Because of this, the API proposed here will likely need to be adjusted once we agree on the final shape of the common "WAS-controller API".
There was a problem hiding this comment.
While I agree with what you wrote, but I don't think this is a concern here.
The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".
@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:
- The inherent nature of TrainJob is that it should be gang-scheduled.
- Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
- In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
- With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.
At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.
There was a problem hiding this comment.
That's a fair point - it's a choice of scheduling provider. However, under the hood, it effectively acts as an opt-in for the Workload API.
If we establish a common opt-in/opt-out mechanism for WAS integration in the future (1.37), we wouldn't want TrainJob using a custom field while other controllers use the standard approach. Even if the final solution is exactly as you suggested—making Workload the default and dropping this field—it's still an API adjustment we'll need to make before Beta.
There was a problem hiding this comment.
However, under the hood, it effectively acts as an opt-in for the Workload API.
My mental model is that it's actually not opt-in for Workload API.
For me it's "define my provider".
I will throw a parallel example here. Let's imagine Karpenter vs CustomComputeClass. Underneath for many functionalities you can achieve the same, but the api representation is very different.
If you have a API concept where you just state "karpenter" or "ccc" the the necessary objects are being created for you behind the scenes - that's actually pretty compelling.
In other want - you want certain specific functionalities, but these may backed by different implementations/components in different environments. And you need to tell your controller which one it should now talk to to achieve what we need.
So for me it's actually not an opt-in for Workload API.
There was a problem hiding this comment.
One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling
This is pretty interesting, do we have any work in progress doc to explore this?
please correct me if I'm wrong, but the way I understand it is that:
@wojtek-t Yes, your understanding is correct (users can also get gang-scheduling for TrainJob with Kueue or KAI Scheduler).
As I mentioned in the future plans, we might want to re-use this API for other WAS features like Topology-Aware Scheduling / DRA, so we should see whether this API makes sense for future extendability.
wojtek-t
left a comment
There was a problem hiding this comment.
Thanks a lot for helping us to validate the Workload/PodGroup design!
| ownerReferences: | ||
| - apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: TrainJob | ||
| name: my-job |
There was a problem hiding this comment.
In the Job integration KEP - we made this explicitly controller: true ownerRef.
Same for PodGroup.
I think it would be highly beneficial to unify that.
| ```yaml | ||
| spec: | ||
| schedulingGroup: | ||
| podGroupName: my-job-trainer-<hash> |
There was a problem hiding this comment.
Similarly - we add additional ownerRef to PodGroup from pods.
That is useful to ensure that Pods will get deleted if the PodGroup gets deleted...
There was a problem hiding this comment.
Good callout, I've added similar diagram with OwnerReference relationship as @helayoty added to the Job KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/5547-integrate-workload-with-job/README.md#ownerreferences-relationship
|
|
||
| // Workload plugin using native Kubernetes Workload API for gang-scheduling | ||
| // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled. | ||
| Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` |
There was a problem hiding this comment.
While I agree with what you wrote, but I don't think this is a concern here.
The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".
@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:
- The inherent nature of TrainJob is that it should be gang-scheduled.
- Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
- In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
- With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.
At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.
| 1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups | ||
| atomically, ensuring all Pods in a gang are scheduled together. | ||
|
|
||
| 1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved. |
There was a problem hiding this comment.
I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.
The question is what to do with the Workload - given it's effectively policy definition, I would say that we should keep it. But I don't have very strong opinion yet. I think that @kannon92 was suggesting removing that too.
I don't think that this comment is strictly Alpha requirement, but I think we should rethink the patter before Beta.
There was a problem hiding this comment.
The API has changed so much in a month..
When MinCount was in the policy definition (workload) we have to clean it up on suspend. I have gotten requests that a suspended workload should be mutable. We have use cases where users want to right size a workload to match the actual usage (probably difficult but there are some internal services trying to guess how much GPUs you actually need for a llm versus how much you requested).
I also want the ability to have a LLM (or user) potentially modify a suspended workload so the workload could be admitted sooner (say you request 8 nodes with 64 gpus but maybe youll workload can be admitted with 4 pods and 32 gpus sooner. So they they should be able to modify the parallelism and the requests accordingly.
I have to check again with the Workload API (with TAS / DRA / etc) about what information we would need to modify in the workload API if a workload is suspended.
There was a problem hiding this comment.
So they they should be able to modify the parallelism and the requests accordingly.
Yes, we also discuss similar use-cases with @VassilisVassiliadis: #2599 (comment).
Where external service (e.g. LLM) can decide optimal number of GPUs for TrainJob.
That will require to firstly submit TrainJob is suspended state, trigger the classifier model, set correct container resources, un-suspend the TrainJob.
I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.
That make sense, but this decision will depend on mutability of Workload fields (e.g. minCount).
If we decide to keep Workload API immutable, we might require to re-create it once resources are changed during TrainJob suspension.
There was a problem hiding this comment.
I'm pretty convinced that minCount should be mutable field (not only for suspended workloads).
But despite that, it's only about the Workload.
But I think no matter what, we should always delete PodGroup(s) if we suspend the workload.
|
|
||
| 1. Replace existing Coscheduling or Volcano plugins - this is an additional scheduling option | ||
| 1. Support all Workload API features immediately - focus on core gang scheduling first | ||
| 1. Support Kubernetes versions < 1.35 - Workload API requires v1.35+ |
There was a problem hiding this comment.
Actually, in 1.36 the Workload API was fundamentally changed by splitting it into Workload and PodGroup APIs. I would recommend integrating with the 1.36 version of the API, as this is the shape we'll be supporting in the future.
There was a problem hiding this comment.
Good point, let me change it!
|
|
||
| ```yaml | ||
| apiVersion: scheduling.k8s.io/v1alpha1 | ||
| kind: PodGroup |
There was a problem hiding this comment.
Yeah, so this is available only in 1.36+. You should update the text above that tells about supporting 1.35+
| numNodes: 1 | ||
| mpi: | ||
| numProcPerNode: 4 | ||
| mpiImplementation: OpenMPI |
There was a problem hiding this comment.
Isn't this missing:
podGroupPolicy:
workload: {}
?
|
|
||
| ```go | ||
| func (w *Workload) EnforcePodGroupPolicy(info *runtime.Info, trainJob *trainv1alpha1.TrainJob) error { | ||
| // 1. Set schedulingGroup.podGroupName in all Pod templates |
There was a problem hiding this comment.
This design works as long as there is a 1:1 mapping between a TrainJob and a PodGroup.
However, we must consider more advanced WAS requirements that go beyond a single global PodGroup.
I think a non-theoretical use case involves a replicatedJob (e.g., "trainer") with replicas: N (N>1). If we need each replica to be placed within its own isolated topology domain, such as a specific rack or a distinct NVLinkDomain, then we need N PodGroups - one for each job replica. And in such scenario we have no way of setting the schedulingGroup.podGroupName properly. The only way to do that is to have Job controller create the PodGroup. This is the scenario we're discussing in the https://docs.google.com/document/d/1VG7Zto9JYuPG4Anb01WMRryJlfV6met0jgob3T2NjZ4/edit?tab=t.0#heading=h.3s8c0yl3azl9
There was a problem hiding this comment.
Hey @mm4tt, as we discussed, do you have any thoughts on this?
Also, would love to hear your thoughts on: #3219 (comment)
There was a problem hiding this comment.
TBH, I no longer believe the vision presented in the WAS Controller Integration API Design is feasible within a reasonable timeframe. While it provides a clean long-term API surface, it suffers from the critical capability leaks and upstream dependency bottlenecks described in the WAS Controller API Alternatives Doc.
I will follow up with a KEP draft next week. Overall, my plan for this KEP is now structured as follows:
- Define reusable API building blocks (e.g., scheduling policies, topology constraints, disruption modes) under scheduling.k8s.io to be consumed on the controller side.
- Define integration recommendations, propose solutions to common architectural problems, and provide shared libraries to handle boilerplate.
- Leave the final API decisions to controller owners. They can follow our conventions or utilize established patterns native to their APIs (like targetReplicatedJobs for JobSet).
- While this approach results in greater API fragmentation, it is a necessary trade-off to bypass dependency bottlenecks that would block TrainJob integration for months while waiting for JobSet and Job support.
Going back to the specific issue of workloads with three or more levels: the only viable solution is delegating PodGroup creation to the child controller. The KEP will outline recommendations for passing the required context downstream, such as providing the Workload or PodGroup template to be used. Stay tuned!
|
@mm4tt @tenzen-y @astefanutti @robert-bell I am wondering if we should try to integrate Workload API into Trainer in the same way how we started with At least that will allow us to iterate and play with the API. cc @kannon92 |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Starting with a simple approach in alpha to get early feedback on how a given controller works with WAS APIs is reasonable. This aligns with other ongoing integrations, such as LWS PR #844 and KubeRay PR #4723. Once we have the KEP for the controller-side API for consuming WAS features, we can update these integrations to adopt it. This will provide users with more control over which specific WAS features are utilized. |
That sound good, @tenzen-y @robert-bell @VassilisVassiliadis @astefanutti @akshaychitneni any thoughts on this? We already have users who would benefit from this feature upstream. |
Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.
This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.
The implementation should be started after Kubernetes v1.36
Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)
/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team
/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis