Skip to content

[KEP-5440]: Mutable container resources on PodTemplates for suspended jobs #5441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kannon92
Copy link
Contributor

  • One-line PR description: Initial KEP draft for KEP-5440
  • Other comments: Need to gain consesus among sig-apps first.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 28, 2025
@k8s-ci-robot k8s-ci-robot requested review from kow3ns and soltysh June 28, 2025 17:18
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign janetkuo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Jun 28, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Jun 28, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 28, 2025
@kannon92 kannon92 force-pushed the mutable-pod-template-on-suspend branch from 2c9abe0 to f6e6006 Compare June 28, 2025 17:19
@kannon92
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2025
Introduce a new KEP proposal to allow updating container resource
specifications (CPU, memory, GPU, extended resources) for suspended jobs.
Key features:

- Enable dynamic resource allocation for suspended jobs only
- Support CPU, memory, and GPU resource mutations
- Include extended resources (nvidia.com/gpu, amd.com/gpu, tpu-v4, etc.)
- Allow queue controllers to optimize resource allocation based on
  cluster conditions
- Feature gate: MutableJobPodResourcesForSuspendedJobs
- Focus on batch workload optimization scenarios

This proposal enables better cluster utilization and cost optimization
by allowing queue controllers to adjust job resource requirements before
execution based on real-time cluster capacity and resource availability.
Particularly valuable for expensive GPU and specialized hardware resources.
@kannon92 kannon92 force-pushed the mutable-pod-template-on-suspend branch from f6e6006 to b76ca88 Compare July 3, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

2 participants