Skip to content

[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219

Open
andreyvelich wants to merge 9 commits into
kubeflow:masterfrom
andreyvelich:was-kep
Open

[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219
andreyvelich wants to merge 9 commits into
kubeflow:masterfrom
andreyvelich:was-kep

Conversation

@andreyvelich
Copy link
Copy Markdown
Member

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis

Copilot AI review requested due to automatic review settings February 17, 2026 19:08
@google-oss-prow google-oss-prow Bot requested a review from kannon92 February 17, 2026 19:08
@google-oss-prow
Copy link
Copy Markdown

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mm4tt, vsoch, dom4ha, VassilisVassiliadis, helayoty, wojtek-t, klueska.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial KEP/proposal document for integrating Kubernetes Workload API–based workload-aware (gang) scheduling into Kubeflow Trainer’s TrainJob via a new PodGroupPolicy plugin, as groundwork for future implementation.

Changes:

  • Introduces a new KEP (3015) describing Workload/PodGroup orchestration for TrainJob scheduling.
  • Documents intended API surface (podGroupPolicy.workload) and controller/plugin responsibilities across build/enforcement/watch phases.
  • Outlines lifecycle, RBAC, feature gates, and a test plan for the proposed integration.

Comment thread docs/proposals/3015-workload-aware-scheduling/README.md Outdated
Comment thread docs/proposals/3015-workload-aware-scheduling/README.md
Comment thread docs/proposals/3015-workload-aware-scheduling/README.md
Comment thread docs/proposals/3015-workload-aware-scheduling/README.md Outdated
Comment thread docs/proposals/3015-workload-aware-scheduling/README.md
Comment thread docs/proposals/3015-workload-aware-scheduling/README.md Outdated
Comment thread docs/proposals/3015-workload-aware-scheduling/README.md Outdated
@coveralls
Copy link
Copy Markdown

coveralls commented Feb 17, 2026

Pull Request Test Coverage Report for Build 22145283120

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 55.998%

Totals Coverage Status
Change from base Build 22081023611: 0.0%
Covered Lines: 1391
Relevant Lines: 2484

💛 - Coveralls

@vsoch
Copy link
Copy Markdown
Contributor

vsoch commented Feb 18, 2026

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

@kannon92
Copy link
Copy Markdown
Contributor

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

Its a hard discussion to follow.

The general stance now is that the api that a user creates (sometimes known as true workload) should probably manage the podGroups / Workloads.

For Trainer we could maybe consider delegating this to JobSet for JobSet based TrainingRuntimes but that doesn't solve the problem for Flux or whatever other backend we have.

What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

The main outcome would be that if there exists a workloadRef or PodGroupRef on the pod template we are to assume that the WAS scheduled is "handled". But we haven't updated the KEP for JobSet. That was at least what we were discussing on all the features.

@vsoch
Copy link
Copy Markdown
Contributor

vsoch commented Feb 18, 2026

The FluxPolicy implementation is just tweaking a JobSet, so I think if support was added to that underlying abstraction it would be inherited by Flux. I want to make sure we are careful to (essentially) not implement this layer twice, but rather just once at the most appropriate level (maybe JobSet?)

@andreyvelich
Copy link
Copy Markdown
Member Author

Yes, @kannon92 is right – we’ve had several discussions about this over the past few months.

In general, higher-level controllers should manage Workload and PodGroup objects, since they have the most complete understanding of the AI application they are orchestrating. For example:

For now, I suggest that we add validation in the TrainJob controller to ensure users do not configure the Workload API directly in JobSet or Job objects.

Copy link
Copy Markdown

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks for working on this integration. Having working examples of WAS being adopted by real-workload controllers gives us the confidence we need to push WAS to Beta in 1.37. Great stuff!


// Workload plugin using native Kubernetes Workload API for gang-scheduling
// Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled.
Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"`
Copy link
Copy Markdown

@mm4tt mm4tt Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this approach differs from the Job KEP (kubernetes/enhancements#5871), where we deliberately avoided Job API changes and simply enabled gang-scheduling by default in specific cases.

That said, introducing an explicit opt-in API here makes sense for now, as it naturally fits into the existing structure (e.g., gang-scheduling via coscheduling or Volcano).

One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling. We want to avoid a fragmented ecosystem where every controller exposes its own API for WAS. Because of this, the API proposed here will likely need to be adjusted once we agree on the final shape of the common "WAS-controller API".

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with what you wrote, but I don't think this is a concern here.

The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".

@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:

  1. The inherent nature of TrainJob is that it should be gang-scheduled.
  2. Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
  3. In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
  4. With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.

At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point - it's a choice of scheduling provider. However, under the hood, it effectively acts as an opt-in for the Workload API.

If we establish a common opt-in/opt-out mechanism for WAS integration in the future (1.37), we wouldn't want TrainJob using a custom field while other controllers use the standard approach. Even if the final solution is exactly as you suggested—making Workload the default and dropping this field—it's still an API adjustment we'll need to make before Beta.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, under the hood, it effectively acts as an opt-in for the Workload API.

My mental model is that it's actually not opt-in for Workload API.
For me it's "define my provider".

I will throw a parallel example here. Let's imagine Karpenter vs CustomComputeClass. Underneath for many functionalities you can achieve the same, but the api representation is very different.
If you have a API concept where you just state "karpenter" or "ccc" the the necessary objects are being created for you behind the scenes - that's actually pretty compelling.

In other want - you want certain specific functionalities, but these may backed by different implementations/components in different environments. And you need to tell your controller which one it should now talk to to achieve what we need.

So for me it's actually not an opt-in for Workload API.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling

This is pretty interesting, do we have any work in progress doc to explore this?

please correct me if I'm wrong, but the way I understand it is that:

@wojtek-t Yes, your understanding is correct (users can also get gang-scheduling for TrainJob with Kueue or KAI Scheduler).
As I mentioned in the future plans, we might want to re-use this API for other WAS features like Topology-Aware Scheduling / DRA, so we should see whether this API makes sense for future extendability.

Copy link
Copy Markdown

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for helping us to validate the Workload/PodGroup design!

ownerReferences:
- apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
name: my-job
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Job integration KEP - we made this explicitly controller: true ownerRef.
Same for PodGroup.

I think it would be highly beneficial to unify that.

```yaml
spec:
schedulingGroup:
podGroupName: my-job-trainer-<hash>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly - we add additional ownerRef to PodGroup from pods.

That is useful to ensure that Pods will get deleted if the PodGroup gets deleted...

Copy link
Copy Markdown
Member Author

@andreyvelich andreyvelich Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Workload plugin using native Kubernetes Workload API for gang-scheduling
// Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled.
Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with what you wrote, but I don't think this is a concern here.

The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".

@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:

  1. The inherent nature of TrainJob is that it should be gang-scheduled.
  2. Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
  3. In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
  4. With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.

At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.

1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups
atomically, ensuring all Pods in a gang are scheduled together.

1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.

The question is what to do with the Workload - given it's effectively policy definition, I would say that we should keep it. But I don't have very strong opinion yet. I think that @kannon92 was suggesting removing that too.

I don't think that this comment is strictly Alpha requirement, but I think we should rethink the patter before Beta.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API has changed so much in a month..

When MinCount was in the policy definition (workload) we have to clean it up on suspend. I have gotten requests that a suspended workload should be mutable. We have use cases where users want to right size a workload to match the actual usage (probably difficult but there are some internal services trying to guess how much GPUs you actually need for a llm versus how much you requested).

I also want the ability to have a LLM (or user) potentially modify a suspended workload so the workload could be admitted sooner (say you request 8 nodes with 64 gpus but maybe youll workload can be admitted with 4 pods and 32 gpus sooner. So they they should be able to modify the parallelism and the requests accordingly.

I have to check again with the Workload API (with TAS / DRA / etc) about what information we would need to modify in the workload API if a workload is suspended.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they they should be able to modify the parallelism and the requests accordingly.

Yes, we also discuss similar use-cases with @VassilisVassiliadis: #2599 (comment).
Where external service (e.g. LLM) can decide optimal number of GPUs for TrainJob.
That will require to firstly submit TrainJob is suspended state, trigger the classifier model, set correct container resources, un-suspend the TrainJob.

I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.

That make sense, but this decision will depend on mutability of Workload fields (e.g. minCount).
If we decide to keep Workload API immutable, we might require to re-create it once resources are changed during TrainJob suspension.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty convinced that minCount should be mutable field (not only for suspended workloads).
But despite that, it's only about the Workload.

But I think no matter what, we should always delete PodGroup(s) if we suspend the workload.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense!


1. Replace existing Coscheduling or Volcano plugins - this is an additional scheduling option
1. Support all Workload API features immediately - focus on core gang scheduling first
1. Support Kubernetes versions < 1.35 - Workload API requires v1.35+
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in 1.36 the Workload API was fundamentally changed by splitting it into Workload and PodGroup APIs. I would recommend integrating with the 1.36 version of the API, as this is the shape we'll be supporting in the future.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me change it!


```yaml
apiVersion: scheduling.k8s.io/v1alpha1
kind: PodGroup
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this is available only in 1.36+. You should update the text above that tells about supporting 1.35+

numNodes: 1
mpi:
numProcPerNode: 4
mpiImplementation: OpenMPI
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this missing:

podGroupPolicy:
    workload: {}

?


```go
func (w *Workload) EnforcePodGroupPolicy(info *runtime.Info, trainJob *trainv1alpha1.TrainJob) error {
// 1. Set schedulingGroup.podGroupName in all Pod templates
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design works as long as there is a 1:1 mapping between a TrainJob and a PodGroup.

However, we must consider more advanced WAS requirements that go beyond a single global PodGroup.

I think a non-theoretical use case involves a replicatedJob (e.g., "trainer") with replicas: N (N>1). If we need each replica to be placed within its own isolated topology domain, such as a specific rack or a distinct NVLinkDomain, then we need N PodGroups - one for each job replica. And in such scenario we have no way of setting the schedulingGroup.podGroupName properly. The only way to do that is to have Job controller create the PodGroup. This is the scenario we're discussing in the https://docs.google.com/document/d/1VG7Zto9JYuPG4Anb01WMRryJlfV6met0jgob3T2NjZ4/edit?tab=t.0#heading=h.3s8c0yl3azl9

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that’s a good point. @mm4tt @wojtek-t, do you still prefer the integration approach where users can configure schedulingPolicy across the different abstraction layers (TrainJob, JobSet, Job)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mm4tt, as we discussed, do you have any thoughts on this?
Also, would love to hear your thoughts on: #3219 (comment)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I no longer believe the vision presented in the WAS Controller Integration API Design is feasible within a reasonable timeframe. While it provides a clean long-term API surface, it suffers from the critical capability leaks and upstream dependency bottlenecks described in the WAS Controller API Alternatives Doc.

I will follow up with a KEP draft next week. Overall, my plan for this KEP is now structured as follows:

  1. Define reusable API building blocks (e.g., scheduling policies, topology constraints, disruption modes) under scheduling.k8s.io to be consumed on the controller side.
  2. Define integration recommendations, propose solutions to common architectural problems, and provide shared libraries to handle boilerplate.
  3. Leave the final API decisions to controller owners. They can follow our conventions or utilize established patterns native to their APIs (like targetReplicatedJobs for JobSet).
  4. While this approach results in greater API fragmentation, it is a necessary trade-off to bypass dependency bottlenecks that would block TrainJob integration for months while waiting for JobSet and Job support.

Going back to the specific issue of workloads with three or more levels: the only viable solution is delegating PodGroup creation to the child controller. The KEP will outline recommendations for passing the required context downstream, such as providing the Workload or PodGroup template to be used. Stay tuned!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for your work @mm4tt!

@andreyvelich
Copy link
Copy Markdown
Member Author

andreyvelich commented Apr 21, 2026

@mm4tt @tenzen-y @astefanutti @robert-bell I am wondering if we should try to integrate Workload API into Trainer in the same way how we started with batch/Job initially?
E.g. when FeatureFlag: WorkloadAwareScheduling is enabled, we just create the default Workload + PodGroup based on TrainJob and ClusterTrainingRuntime configuration.
Initially, we only create it for GangScheduling capability.

At least that will allow us to iterate and play with the API.

cc @kannon92

andreyvelich and others added 8 commits April 22, 2026 00:44
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Comment thread docs/proposals/3015-workload-aware-scheduling/README.md Outdated
@helayoty helayoty moved this from Backlog to In Review in Workload-aware & Topology-aware Workstream May 12, 2026
@mm4tt
Copy link
Copy Markdown

mm4tt commented May 14, 2026

@mm4tt @tenzen-y @astefanutti @robert-bell I am wondering if we should try to integrate Workload API into Trainer in the same way how we started with batch/Job initially? E.g. when FeatureFlag: WorkloadAwareScheduling is enabled, we just create the default Workload + PodGroup based on TrainJob and ClusterTrainingRuntime configuration. Initially, we only create it for GangScheduling capability.

At least that will allow us to iterate and play with the API.

cc @kannon92

Starting with a simple approach in alpha to get early feedback on how a given controller works with WAS APIs is reasonable. This aligns with other ongoing integrations, such as LWS PR #844 and KubeRay PR #4723.

Once we have the KEP for the controller-side API for consuming WAS features, we can update these integrations to adopt it. This will provide users with more control over which specific WAS features are utilized.

@andreyvelich
Copy link
Copy Markdown
Member Author

Starting with a simple approach in alpha to get early feedback on how a given controller works with WAS APIs is reasonable. This aligns with other ongoing integrations, such as kubernetes-sigs/lws#844 and ray-project/kuberay#4723.

That sound good, @tenzen-y @robert-bell @VassilisVassiliadis @astefanutti @akshaychitneni any thoughts on this?
Shall we target this as part of Trainer v2.3 release?

We already have users who would benefit from this feature upstream.

Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants