Skip to content

KEP for GangConfig (to support Gang Scheduling)#1068

Open
imreddy13 wants to merge 5 commits into
kubernetes-sigs:mainfrom
imreddy13:main
Open

KEP for GangConfig (to support Gang Scheduling)#1068
imreddy13 wants to merge 5 commits into
kubernetes-sigs:mainfrom
imreddy13:main

Conversation

@imreddy13
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Proposal to add a new field GangConfig to ReplicatedJob API to support Gang Scheduling (KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling)

Which issue(s) this PR fixes:

Fixes #969

Does this PR introduce a user-facing change?

     NONE

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Oct 15, 2025
@netlify
Copy link
Copy Markdown

netlify Bot commented Oct 15, 2025

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 9710b12
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-jobset/deploys/69b9a3aa3b53bb0007955a4d

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 15, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 15, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @imreddy13. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 15, 2025
@imreddy13 imreddy13 mentioned this pull request Oct 15, 2025
3 tasks
@imreddy13 imreddy13 marked this pull request as draft October 15, 2025 22:16
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2025
@imreddy13 imreddy13 force-pushed the main branch 2 times, most recently from 7a09bc3 to da94057 Compare October 15, 2025 22:41
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 15, 2025
@imreddy13 imreddy13 marked this pull request as ready for review October 15, 2025 22:44
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2025
@imreddy13
Copy link
Copy Markdown
Contributor Author

/assign @kannon92

@imreddy13
Copy link
Copy Markdown
Contributor Author

imreddy13 commented Oct 15, 2025

/assign @ahg-g

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@imreddy13: GitHub didn't allow me to assign the following users: ahg.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @AHG

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@imreddy13
Copy link
Copy Markdown
Contributor Author

/unassign @danielvegamyhre

@imreddy13
Copy link
Copy Markdown
Contributor Author

/assign @ahg-g

@andreyvelich
Copy link
Copy Markdown
Member

Thanks for this @imreddy13!
I am wondering how we can be aligned with what happening for the Job API ?

IIUC, @soltysh and @helayoty discussed that changes to the Job API will be introduced at the later stage, after Workload API is available in v1.35: kubernetes/enhancements#5548 (comment)

cc @tenzen-y @astefanutti

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: imreddy13
Once this PR has been reviewed and has the lgtm label, please ask for approval from ahg-g. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@imreddy13
Copy link
Copy Markdown
Contributor Author

Thanks for this @imreddy13! I am wondering how we can be aligned with what happening for the Job API ?

IIUC, @soltysh and @helayoty discussed that changes to the Job API will be introduced at the later stage, after Workload API is available in v1.35: kubernetes/enhancements#5548 (comment)

cc @tenzen-y @astefanutti

@andreyvelich I discussed this with @wojtek-t and it's possible the Job API might differ in terms of naming. The contract is if a top level controller (JobSet/LWS) creates a Workload object, the Job controller will not create a new Workload or change the workloadRef in the pod spec. JobSet will not rely on Job controller to create Workloads since we only plan to have 1 Workload per JobSet in all modes.

Is there anything specifically you are concerned about with Job?

@andreyvelich
Copy link
Copy Markdown
Member

The contract is if a top level controller (JobSet/LWS) creates a Workload object, the Job controller will not create a new Workload or change the workloadRef in the pod spec.

Are we going to validate that if users set gang scheduling policy in the JobSet spec, the Job API must be omitted ?

The contract is if a top level controller (JobSet/LWS) creates a Workload object

What is the recommendation for controllers on top of JobSet, like TrainJob controller ? Shall we rely on JobSet to create the Workload object, or top-level controllers should create it ?

Is there anything specifically you are concerned about with Job?

My only concern is to make our APIs consistent with what we envision for Job controller in the future.


The JobSet API will support a new GangConfig field to specify if a JobSet, ReplicatedJob or Job replica should be gang scheduled. The user should set this field at the appropriate level (JobSet or ReplicatedJob) in the JobSet spec to indicate they require gang scheduling at that level. If the JobSet controller detects the GangConfig field in the JobSet spec, it will generate a single Workload object and associate the pod spec templates it generates with that Workload.

The Job controller should not create a new Workload or change the workloadRef in a pod spec if the JobSet controller has already set it. To differentiate the levels of gang scheduling (JobSet, ReplicatedJob or Job replica gangs), the JobSet controller will set different PodGroup and PodGroupReplicaIndex fields in the WorkloadReference spec per pod i.e.:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Job controller should not create a new Workload or change the workloadRef in a pod spec if the JobSet controller has already set it.

Is this "Job controller" work in scope for this enhancement proposal ?

- echo
- "started"
initialDelaySeconds: 5
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an exercise in understanding the proposal: will this example results in all pods pointing to a Workload defined as the following?

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: sample-jobset
spec:
  podGroups:
  - name: podgroup
    replicas: 1
    policy:
      kind: Gang
      gang:
        minCount: 16

Copy link
Copy Markdown
Contributor

@GiuseppeTT GiuseppeTT Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Also, it would be really helpful for understanding this KEP to have one example for each case (workload set by user, JobSet-level workload, Replicated-level workload, Job-level workload). Like, each example should have the JobSet manifest and the resulting objects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to a section "Translating JobSet Workloads to Workloads", ptal @ricardomaraschini

- echo
- "started"
initialDelaySeconds: 5
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an exercise in understanding the proposal: is this going to result in a Workload like this ?

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: sample-jobset
spec:
  podGroups:
  - name: replicated-job-1
    replicas: 1
    policy:
      kind: Gang
      gang:
        minCount: 8
  - name: replicated-job-2
    replicas: 1
    policy:
      kind: Gang
      gang:
        minCount: 3

With pods on replicated-job-1 and replicated-job-2 pointing to their respective (and different) pod groups ?

- echo
- "started"
initialDelaySeconds: 5
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an exercise in understanding the proposal: is this result in a Workload similar to the following ?

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: sample-jobset
spec:
  podGroups:
  - name: replicated-job-1
    replicas: 2
    policy:
      kind: Gang
      gang:
        minCount: 4
  - name: replicated-job-2
     policy:
     replicas: 3
      kind: Gang
      gang:
        minCount: 3

With pods on replicated-job-1 and replicated-job-2 pointing to their respective (and different) pod groups + a different PodGroupReplicaIndex for each replica ?

Copy link
Copy Markdown

@ingvagabund ingvagabund Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could help to better understand the mapping by adding workload and pod examples for all three (resp. two as the first use case is for users to provide) use cases right before ### Implementation so it's obvious how the corresponding mappings work. The pod manifest can be simplified by showing only the relevant fields. E.g. .name, .spec.workload

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new section with examples: "Translating JobSet Workloads to Workloads"


// GangConfig should be specified if all ReplicatedJobs in this JobSet
// should be scheduled as a gang (i.e. all at once).
GangConfig GangConfig json:"gangConfig,omitempty"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples above are using scheduleAtOnce and scheduleAtOncePerReplica instead.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for unifying the keywords


#### Workload Creation

The JobSet controller `Reconcile()` loop will be updated to generate a single Workload object per JobSet if the `GangConfig` field is specified on the JobSetSpec or ReplicatedJob.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EP mentions above:

Note that in v1, we will only support the API change for ReplicatedJob and not JobSetSpec.

Is the plan to make the controller to create a Workload for the whole JobSetSpec when that is populated ?


#### Workload Deletion

When the JobSet is deleted, the JobSet controller will also delete the Workload it created for that JobSet.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was using owner references considered here ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using owner references. It's more robust and easier to implement.

#### Defaulting/Validation

- `jobSetSpec.GangConfig.GangMode` and `replicatedJob.GangConfig.GangMode` cannot be specified at once (except to `GangModeOff`).
- `jobSetSpec.GangConfig.GangMode` and `replicatedJob.GangConfig.GangMode` are immutable.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobSetSpec.GangConfig.GangMode == "Gang" with multiple replicaJobs who leverage DependsOn seems to be a needed validation here.

name: sample-jobset
spec:
replicatedJobs:
- name: replicated-job-1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: more indentation here

1. `jobSetSpec.GangConfig.GangMode` options:
- `GangModeOff`: no gang scheduling
- `GangModeGang`: all pods in the JobSet are in a gang
2. `replicatedJob.GangConfig.GangMode` options:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replicatedJob.GangConfig.GangMode -> replicatedJob[].GangConfig.GangMode (to comply with jq accessing an array)

Might make sense to prepend beginning of each path with .. E.g. .jobSetSpec.GangConfig.GangMode

- echo
- "started"
initialDelaySeconds: 5
```
Copy link
Copy Markdown

@ingvagabund ingvagabund Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could help to better understand the mapping by adding workload and pod examples for all three (resp. two as the first use case is for users to provide) use cases right before ### Implementation so it's obvious how the corresponding mappings work. The pod manifest can be simplified by showing only the relevant fields. E.g. .name, .spec.workload


// GangConfig should be specified if all ReplicatedJobs in this JobSet
// should be scheduled as a gang (i.e. all at once).
GangConfig GangConfig json:"gangConfig,omitempty"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for unifying the keywords

@ingvagabund
Copy link
Copy Markdown

Should the controller generate an event every time a workload object is created/updated/deleted?


## Proposal

The JobSet API will support a new GangConfig field to specify if a JobSet, ReplicatedJob or Job replica should be gang scheduled. The user should set this field at the appropriate level (JobSet or ReplicatedJob) in the JobSet spec to indicate they require gang scheduling at that level. If the JobSet controller detects the GangConfig field in the JobSet spec, it will generate a single Workload object and associate the pod spec templates it generates with that Workload.
Copy link
Copy Markdown

@ingvagabund ingvagabund Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a JobSet, ReplicatedJob or Job replica should be gang scheduled.

Worth mentioning a replicated job is a collection of jobs. Resp. a replicated job consists of one or two job replicas (I wonder what's the right definition here). So the hierarchy is quickly obvious. A diagram/picture of the hierarchy would help a lot in speeding up the first understanding. E.g. slightly updating https://kubernetes.io/blog/2025/03/23/introducing-jobset/#how-jobset-works and drawing the available gang groups.

@GiuseppeTT
Copy link
Copy Markdown
Contributor

What is the recommendation for controllers on top of JobSet, like TrainJob controller ? Shall we rely on JobSet to create the Workload object, or top-level controllers should create it ?

That's a good question.

IMO it makes sense to let the uppermost workload-like object (JobSet, Job, etc) to create the Workload object.

For instance, if we have a stack like

  • TrainJob
  • JobSet (uppermost workload-like object)
  • Jobs
  • Pods

it makes sense for JobSet to create the workload object.

If instead we have

  • TrainJob
  • Job without a parent JobSet (uppermost workload-like object)
  • Pods

It makes sense for Job to create the workload object.

### Graduation Criteria

- `kube-scheduler` changes and `Workload` API alpha are targeted to `1.35`.
- Support for `replicatedJob.GangConfig` will be alpha in 1.35 behind an alpha feature gate and graduated to stable when `Workload` API is stable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a feature gate for JobSet with this KEP that matches the name of the gang scheduling feature gate.

But we will want to keep this feature disabled by default until workload API is stable on all supported k8s versions of JobSet.

We will need to introduce feature gate handling in JobSet but I think this feature warrants this if we want to merge this in 1.35.

1. If `jobSetSpec.GangConfig.GangMode == GangModeGang`, all JobSet pods are one gang:
- Create a single `PodGroup` for the JobSet.
- Set `Replicas` to 1
- Set the `minCount` in the `Gang` to `#replicated jobs * #job replicas * #pods per Job (parallelism)`. (`minCount` is the number of pods that should be ready before the gang is scheduled).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's sum(replicatedJob.replicas * replicatedJob.template.spec.parallelism for replicatedJob in jobSet.spec.replicatedJobs)

spec:
workload:
name: w-sample-jobset
podGroup: pg-sample-jobset
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the indentation is off

metadata:
name: sample-jobset
spec:
scheduleAtOnce: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having this API in the rJob and JobSet level, could we just define PodGroupPolicies API which has the target jobs parameter? Something like this:

podGroupPolicies:
  - name: initializer
    targetReplicatedJobs: ["dataset-initializer"]
  - name: mpi-trainer
    targetReplicatedJobs: ["launcher", "node"]

Single Workload for all rJob:

podGroupPolicies:
  - name: gang-group
    targetReplicatedJobs: []

That will make us consistent with other JobSet APIs and VolumeClaimPolicies: #1062

WDYT @imreddy13 @GiuseppeTT @kannon92 @tenzen-y @astefanutti ?

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 20, 2025
@helayoty
Copy link
Copy Markdown

helayoty commented Dec 8, 2025

@andreyvelich @imreddy13 @GiuseppeTT @kannon92 @tenzen-y @astefanutti I put some thoughts together about how the Workload API should be integrated with different true workoads (i.e., Job, JobSet, LWS, etc...) and I really appreciate your feedback.
https://docs.google.com/document/d/1oGa_zA1HSlvoAsR-Sks9NBlHe6ytDh5EdxMqmmJNJbE/edit?usp=sharing

@kannon92
Copy link
Copy Markdown
Contributor

@andreyvelich @imreddy13 @GiuseppeTT @kannon92 @tenzen-y @astefanutti I put some thoughts together about how the Workload API should be integrated with different true workoads (i.e., Job, JobSet, LWS, etc...) and I really appreciate your feedback. https://docs.google.com/document/d/1oGa_zA1HSlvoAsR-Sks9NBlHe6ytDh5EdxMqmmJNJbE/edit?usp=sharing

In thinking through this, I wanted to sketch out design/implementation for what I am thinking for Workload integration.

#1111

@GiuseppeTT
Copy link
Copy Markdown
Contributor

/retest

One of the failures is a flake.

The other is due to the table of content not being up to date

Checking table of contents are up to date...
2026/01/30 22:04:23 keps/808-GangConfig/README.md: changes found:
- [Summary](#summary)
- [Motivation](#motivation)
  - [Goals](#goals)
  - [Non-Goals](#non-goals)
- [Proposal](#proposal)
  - [User Stories (Optional)](#user-stories-optional)
    - [Story 1](#story-1)
    - [Story 2](#story-2)
    - [Story 3](#story-3)
  - [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
  - [API proposal](#api-proposal)
  - [Implementation](#implementation)
    - [Workload Creation](#workload-creation)
    - [Pod Workload Reference Creation](#pod-workload-reference-creation)
    - [Workload Deletion](#workload-deletion)
    - [Defaulting/Validation](#defaultingvalidation)
  - [Translating JobSet Workloads to Workloads](#translating-jobset-workloads-to-workloads)
    - [JobSet Level Gang](#jobset-level-gang)
    - [ReplicatedJob Level Gang](#replicatedjob-level-gang)
    - [Replica Level Gang](#replica-level-gang)
  - [Test Plan](#test-plan)
      - [Prerequisite testing updates](#prerequisite-testing-updates)
    - [Unit Tests](#unit-tests)
    - [Integration tests](#integration-tests)
  - [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
    - [Multiple Workload Objects per ReplicatedJob and Replica](#multiple-workload-objects-per-replicatedjob-and-replica)
Table of content not up to date. If this failed silently and you are on mac, try 'brew install grep'

@kannon92 kannon92 moved this from Needs Review to In Review in Workload-aware & Topology-aware Workstream Mar 16, 2026
@kannon92 kannon92 moved this from In Review to In Progress in Workload-aware & Topology-aware Workstream Mar 16, 2026
@kannon92 kannon92 moved this from In Progress to Backlog in Workload-aware & Topology-aware Workstream Mar 16, 2026
@kannon92
Copy link
Copy Markdown
Contributor

Hey @imreddy13

K8s should release 1.36 tomorrow so I think we can start this work up. Are you still able to take this on?

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@imreddy13: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-jobset-verify-main 9710b12 link true /test pull-jobset-verify-main
pull-jobset-test-e2e-main-1-36 9710b12 link true /test pull-jobset-test-e2e-main-1-36

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gang Scheduling of JobSets

9 participants