diff --git a/keps/sig-apps/5440-mutable-job-pod-resource-updates/README.md b/keps/sig-apps/5440-mutable-job-pod-resource-updates/README.md new file mode 100644 index 00000000000..c20cc201730 --- /dev/null +++ b/keps/sig-apps/5440-mutable-job-pod-resource-updates/README.md @@ -0,0 +1,537 @@ + +# KEP-5440: Allow updating pod template resources (CPU, memory, GPU, extended resources) of suspended jobs + + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [] (R) KEP approvers have approved the KEP status as `implementable` +- [] (R) Design details are appropriately documented +- [] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [] (R) Production readiness review completed +- [] (R) Production readiness review approved +- [] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +In [#2232](https://github.com/kubernetes/enhancements/issues/2232) we added a new flag +to allow suspending jobs to control when the Pods of a Job get created by controller-manager. +This was proposed as a primitive to allow a higher-level queue controller to implement +job queuing: the queue controller unsuspends the job when resources become available. + +To complement the above capability, a queue controller may also want to control the +resource requirements of a job based on current cluster capacity or resource availability. +For example, it may want to adjust CPU, memory, and GPU requests/limits based on available +node capacity, allocate specific extended resources like TPUs or FPGAs, optimize resource +allocation for better cluster utilization, or modify resource requirements based on queue +priority and cluster load. + +This is a proposal to relax update validation on suspended jobs to allow mutating +resource specifications in the job's pod template, specifically CPU, memory, GPU, +and other extended resource requests and limits. This enables a higher-level queue +controller to optimize resource allocation before un-suspending a job based on +current cluster conditions and resource availability. + +## Motivation + +Most kubernetes batch workloads have dynamic resource requirements that may not be +known at job creation time. The optimal resource allocation for a job often depends +on current cluster conditions, available capacity, and queue priorities that change +over time. This is especially true for GPU and other specialized hardware resources +which are expensive and have limited availability. + +We made the first step towards achieving better resource management by introducing the +`suspend` flag to the Job API, which allowed a queue controller to decide when a job +should start. However, once a job's resource requirements are set at creation time, +there's no way to optimize them based on actual cluster conditions when the job is +ready to run. + +Adding the ability to mutate a job's resource requirements while it's suspended gives +a queue controller the ability to optimize resource allocation based on real-time +cluster conditions, improve overall cluster utilization, and ensure jobs are sized +appropriately for current capacity constraints. + + +### Goals + +- Allow mutating CPU, memory, GPU, and extended resource requests and limits of a container within a PodTemplate of a suspended jobs. +- Enable queue controllers to optimize resource allocation based on cluster conditions. +- Improve cluster resource utilization through dynamic resource sizing, especially for expensive GPU and specialized hardware. + +### Non-Goals + +- Implement a queue controller. +- Allow mutating resource specifications of jobs that are currently running. This could + disrupt running workloads and complicate resource management. +- Allow mutating resource specifications of pods directly. +- Allow mutating other job specifications beyond container resource requirements. +- Support in-place pod resource updates (this is covered by separate KEPs). +- Allow mutating of Pod Resources. + +## Proposal + +The proposal is to relax update validation for container resource specifications +(CPU, memory, GPU, and extended resource requests and limits) in the pod template of suspended jobs. + +This change has minimal impact on the job-controller, as the job controller will +use the updated resource specifications when creating new pods for the job. + +### User Stories (Optional) + +#### Story 1 + +I want to build a controller that implements job queueing with dynamic resource optimization. +Users create v1.Job objects, and to control when the job can run, I have a webhook that +forces the jobs to be created in a suspended state. The controller analyzes current +cluster capacity and adjusts job resource requirements to optimize cluster utilization +before unsuspending them. + +At job creation time, users may specify conservative resource estimates or may not know +the optimal resource allocation for current cluster conditions. The queue controller can +analyze available capacity, other queued jobs, and cluster utilization patterns to +determine optimal CPU, memory, and GPU allocations. For example, it might adjust the number of GPUs based on current +availability. By updating the job's resource requirements before unsuspending it, the +controller ensures efficient resource utilization and better cluster throughput. + +### Risks and Mitigations + +- New API calls from queue controllers to update resource specifications. The mitigation + is for such controllers to make a single API call for both updating resources and + unsuspending the job. + +- Potential for resource specification changes to make a job unschedulable if the + updated requirements exceed available cluster capacity. Queue controllers should + validate resource availability before making changes. + +- A race condition could theoretically happen if a job is unsuspended and then quickly + suspended again before resource updates, though this is not a typical use case pattern. + +## Design Details + +The pod template validation logic in the API server needs to be updated to relax the validation +of the Job's Template field. Currently the template is immutable, but we need to make +container resource specifications (CPU, memory, GPU, and extended resources requests and limits) mutable for suspended jobs. + +The condition we will check to verify that the job is suspended is `Job.Spec.Suspend=true`. + +We will allow updates to the following fields in container specifications within the pod template: +- `resources.requests.cpu` +- `resources.requests.memory` +- `resources.requests.*` (for extended resources like `nvidia.com/gpu`, `amd.com/gpu`, `tpu-v4` etc.) +- `resources.limits.cpu` +- `resources.limits.memory` +- `resources.limits.*` (for extended resources like `nvidia.com/gpu`, `amd.com/gpu`, `tpu-v4` etc.) + +### Test Plan + +- Unit and integration tests verifying that: + - Container resource specifications are not mutable for active (non-suspended) jobs. + - Container resource specifications (CPU, memory, GPU, extended resources) are mutable only for suspended jobs. + - Job controller observes the resource updates and creates pods with the new resource specifications. + - Resource validation still applies (e.g., limits >= requests) for all resource types including extended resources. + +#### Unit tests + +- `k8s.io/kubernetes/pkg/registry/batch/job/`: `1/30/2023` - `76.8%` + +#### Integration tests + +We will add the following test scenarios to kubernetes/test/integration/jobs. + +- When a job is suspended with feature gate enabled, resources are able to be mutated. +- When a job is not suspended and feature gate enabled, resources should not be mutated. +- When feature date is disabled and suspended, mutations are not allowed. + +#### e2e tests + +Integration tests offer enough coverage. + +### Graduation Criteria + +We will release the feature directly in Beta state. Because the feature is opt-in and doesn't add +a new field, there is no benefit in having an alpha release. + +#### Beta + +- Feature implemented behind a feature flag +- Unit and integration tests passing + +#### GA + +- Fix any potentially reported bugs + + + +### Upgrade / Downgrade Strategy + +No changes required to existing cluster to use this feature. + +### Version Skew Strategy + +N/A. This feature doesn't impact nodes. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: MutableJobPodResourcesForSuspendedJobs + - Components depending on the feature gate: kube-apiserver +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + +Yes, it relaxes validation of updates to jobs while they are suspended. Specifically, it will allow +mutating the container resource specifications (CPU, memory, GPU, and extended resource +requests and limits) in the pod template of suspended jobs. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. If disabled, kube-apiserver will start rejecting updates to container resource +specifications in job pod templates. + +###### What happens if we reenable the feature if it was previously rolled back? + +kube-apiserver will accept container resource specification updates for suspended jobs. + +###### Are there any tests for feature enablement/disablement? + +No. There are unit tests verifying behavior with feature gate on and off. + +We have integrations test verifying the behavior for feature on and off. + +See [integration-tests](#integration-tests) for more details. + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +The change is opt-in and only affects suspended jobs, so it doesn't impact already +running workloads. However, problems with the updated validation logic may cause +crashes in the apiserver. + +###### What specific metrics should inform a rollback? + +Crashes in the apiserver because of potential problems with the updated validation logic. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +Will be done after beta. In 1.36, we will perform the following test: + +- create a kind cluster with feature gate off + - verify suspend and patching of resources is forbidden + +- create a kind cluster with feature gate on + - verify suspend and patching of resources is allowed. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +N/A. This is not a feature that workloads use directly. + +###### How can someone using this feature know that it is working for their instance? + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [X] Other (treat as last resort) + - Details: Create a suspended job then update the container resource specifications (CPU/memory/GPU/extended resource requests/limits) of the pod template. + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +N/A + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [x] Metrics + - Metric name: apiserver_request_total[resource=job, group=batch, verb=UPDATE, code=400] + - [Optional] Aggregation method: + - Components exposing the metric: kube-apiserver +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +N/A + +### Dependencies + +###### Does this feature depend on any specific services running in the cluster? + +No. + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +The feature itself doesn't generate API calls. But it will allow the +apiserver to accept update requests to mutate container resource specifications +(CPU, memory, GPU, and extended resources) in job pod templates, which will +encourage implementing controllers that do this. + +###### Will enabling / using this feature result in introducing new API types? + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +No. + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +Update requests will be rejected. + +###### What are other known failure modes? + +In a multi-master setup, when the cluster has skewed apiservers, some update requests +may get accepted and some may get rejected. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +N/A. + +## Implementation History + +- July 3th: draft of KEP + +## Drawbacks + +This allows for more mutability of Jobs, particularly around resource specifications which could impact resource planning and scheduling behavior. + +## Alternatives + +NA +## Infrastructure Needed (Optional) + +NA \ No newline at end of file diff --git a/keps/sig-apps/5440-mutable-job-pod-resource-updates/kep.yaml b/keps/sig-apps/5440-mutable-job-pod-resource-updates/kep.yaml new file mode 100644 index 00000000000..80e89bec9dd --- /dev/null +++ b/keps/sig-apps/5440-mutable-job-pod-resource-updates/kep.yaml @@ -0,0 +1,41 @@ +title: Mutable Pod Resources for Suspended Jobs +kep-number: 5440 +authors: + - "@kannon92" +owning-sig: sig-apps +participating-sigs: + - sig-node +status: provisional +creation-date: 2025-06-21 +reviewers: + - "TBD" +approvers: + - "@Huang-Wei" + +see-also: + - "/keps/sig-apps/2232-suspend-jobs" + +# The target maturity stage in the current dev cycle for this KEP. +stage: beta + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.35" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + beta: "v1.35" + stable: "v1.38" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MutableJobPodResourcesForSuspendedJobs + components: + - kube-apiserver +disable-supported: true + +# The following PRR answers are required at beta release +metrics: +