KEP-5517: DRA Node Allocatable Resources alpha2 changes by pravk03 · Pull Request #6082 · kubernetes/enhancements

pravk03 · 2026-05-15T06:30:22Z

One-line PR description: DRA Node Allocatable Resources update for alpha2

Issue link: DRA: Node Allocatable Resource Requests #5517

Other comments:

k8s-ci-robot · 2026-05-15T06:30:24Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

pravk03 · 2026-05-15T18:06:55Z

/sig node
/cc @johnbelamaric @liggitt @pohly (for DRA API review)
/cc @dom4ha @macsko (for sig-scheduling)
/cc @tallclair @ffromani @kad (for sig-node)
/cc @ndixita @natasha41575 (for IPPR and PLR integrations)

natasha41575

For posterity: we'd previously discussed a potential implementation where we use the allocation manager as the source of truth, by overwriting the allocated pod spec limits to be the aggregate spec limits + dra resources, and leaving the kuberuntime manager code untouched. But I think the proposal as-is cleanly handles the complex use cases of shared claims. It does however mean we have to be more careful to ensure we cover every part of kubelet that may be taking assumptions about the pod allocation.

We might need to revisit some of this when dra is made mutable, but we don't need to worry about that now.

natasha41575 · 2026-05-20T21:04:30Z

+    *   **Behavior with DRA**: No change. To ensure steady-state reconciliation loops (`computePodResizeAction`)
+        do not trigger unnecessary CRI updates or cgroup resets, Kubelet maintains the internal `actuated`
+        checkpoint strictly limited to standard Spec requests and limits. DRA allocations are excluded from
+        the checkpoint.


This solution is very elegant. However, we need to be mindful that this introduces an intentional, hidden abstraction layer where the internal actuation checkpoint deliberately diverges from the actual cgroup limits. It works perfectly for preventing reconciliation loops, but we should explicitly call out and carefully document this structural divergence.

I've added a note to capture that this divergence is a deliberate design choice.

I understand why the divergence exist. However this could result in observability gap. Do you think it makes sense to add the DRA-allocated resources to the PodSpec resources in PodStatus though?

Do you think it makes sense to add the DRA-allocated resources to the PodSpec resources in PodStatus though?

DRA allocations are currently surfaced in PodStatus under status.nodeAllocatableResourceClaimStatuses and the status limits will also reflect the DRA resources.

Are you suggesting we should also include them in the allocatedResources (e.g., status.containerStatuses[*].allocatedResources), or did you have a different tracking field in mind?

If yes, I think we'd have to nest it under a new field, since existing allocatedResources field only use requests (not limits). Do you think it's necessary to duplicate the DRA limits under allocatedResources?

We emit events when a pod is resized, and the event emits the entire pod allocation. Do we emit any events when a new pod is admitted? It might be enough for observability to ensure that when such an event is emitted, the emitted allocation in the event includes DRA resources.

If yes, I think we'd have to nest it under a new field, since existing allocatedResources field only use requests (not limits). Do you think it's necessary to duplicate the DRA limits under allocatedResources?

+1. Because allocatedResources is a flat map (map[ResourceName]resource.Quantity), introducing DRA specific fields inside it is not possible, and adding a new struct field like status.allocatedDRAResources would introduce a redundant overlap with .status.nodeAllocatableResourceClaimStatuses. I would prefer for allocatedResources status (at both the pod and container level) to remain strictly Spec based, and resources status (at both the pod and container level) reflects what is read from the cgroup (which includes the DRA driver's enforcement). To make this clear, we could update the API documentation to explicitly state that allocatedResources does not include DRA-based requests and a reference to status.nodeAllocatableResourceClaimStatuses for DRA information.

cc @tallclair

natasha41575 · 2026-05-20T21:25:29Z

+    *   **Behavior with DRA**: No change. Because Kubelet does not update cgroup requests based on DRA claims
+        (keeping CPU shares pure to standard Spec), the `allocated` checkpoint and reported `allocatedResources`
+        remain strictly limited to standard Spec requests and limits. DRA allocations are completely excluded.


Related to my other comment: This solution is simple and elegant but may be a bit unexpected. We are not checkpointing DRA resources as part of the allocation, but DRA resources are inherently evaluated when kubelet performs its allocation (resource fit) checks via canAdmitPod. This means that the DRA resources are practically part of the pod's allocation; just not recorded explicitly in the checkpoint.

This works because DRA resources are immutable. We will need to take care to audit all parts of kubelet that may be making assumptions that the kubelet allocation checkpoint is the source of truth for the entire pod footprint. I think your KEP does a good job covering it, but it should be documented very clearly in the code too.

We'd also have to audit the output messages in Pending / Deferred conditions and events to make sure they are still saying something reasonable and meaningful to users too.

I considered suggesting that we add the DRA resources to the allocation / actuation checkpoint for the sake of cleanly tracking, but given that it's not necessary at this point in time, it's probably not worth it.

Good point. Yes, we rely on the fact that DRA resources are (currently) immutable after allocation and this simplifies the design a lot. I have updated the ### Kubelet Admission Control section to include the above details. I will also make sure to include these design decisions as comments in the code.

ffromani

made a first full pass. Great work here. The only unavoidable challenge is that the sheer size of this work makes it hard to keep everything in mind. But I'm confident few more passes will help me.
I may have added comments that actually have answers later in the doc. Sorry about that, I can only say this is another byproduct of the size of the KEP. Please bear with me :)
The part I'm most wary atm (and the main/only reason is my limited experience in the area) is interaction with IPPR, but I see SMEs already engaging, so I think we're good it.

Overall this is great and I didn't spot any obvious issue, but more passes are warranted anyway to ensure the work gets the careful review it deserves.

ffromani · 2026-05-21T14:43:49Z

-* Until Kubelet is made DRA-aware for node allocatable resources (a non-goal for Alpha), QoS and node-level
-  enforcement will not fully reflect DRA allocations. This is an accepted limitation for the initial
-  Alpha scope.
+* While the Kubelet considers DRA for cgroup enforcement, QoS class classification remains purely based on the standard Spec.


true, but not clear why this is a risk and, if so, what's the mitigation

moving my comment from elsewhere to here

As we change the resource model, I feel existing QoS class classification is becoming outdated and no longer honors its original intention. In the past, QoS class could tell you something about how the pod is consuming physical resources and therefore the Kubelet could make a reasonable decision about who to evict, but between this KEP and the proposal to change QoS shape I feel that our direction is making QoS class a bit meaningless / arbitrary.

It's been proposed before, but a mitigation could be to allow users to explicitly set a QoS class. I think revisiting what QoS class means and how it should be enforced should be discussed as a separate effort / KEP though.

cc @tallclair

Thanks @ffromani. I can include a section about risks of not updating QOS and potential sectiosn.

@natasha41575 I completely agree. This has come up multiple times in the initial discussions on this KEP. Instead of adding even more variables to how QoS class is inferred (container spec, pod-level resources, IPPR, DRA), I would prefer we revisit the decision to implicitly infer QoS and explore making it explicitly definable in the pod spec.

@kad @dchen1107

@natasha41575 from my side, the current implicit determination of QoS is not a good design in overall, but it is something that is deep in tech debt and can't be easily changed without breakages. My intention few years back with KEP 3008 was exactly to have QoS explicitly declarable, and this KEP would be able to benefit from it... but for sake of scope of this kep, I think better not to touch QoS calculations. let it be compatible with existing ecosystem.

but for sake of scope of this kep, I think better not to touch QoS calculations. let it be compatible with existing ecosystem.

100% agree that it should be left out of scope of this KEP.

ffromani · 2026-05-22T08:03:00Z

+    containers: ["c1", "c2"]
+    overhead:
+    - name: cpu
+      quantity: "2" # Pre-resolved total sum: 1 CPU per pod + (500m * 2 containers)


the example made me realize this value is slightly confusing because we kinda conflate the originating values from NodeAllocatableOverhead with their computed value quantity which is IMO more akin a status field or anyway a derived value of sorts.
Honestly: maybe it's just me. I can't say if it's worth another level of nesting to separate the computed values. Probably not?

Good point. I don't have a strong preference, but here are a few more options we could consider:

We could remove the quantity field entirely from this strict. perPodReference and perContainerReference are sufficient to calculate the cgroup settings. But this forces us to do the math to figure out the total allocated quantity: perPodReference + numContianerRef * perContianerReference.

Would renaming the quantity field make it more obvious ?

Alternatives: total, allocated

cc @liggitt (for API review feedback on this)

After thinking a bit more, having duplicate information is confusing. I have removed the quantity field, we can get all the information from perPodReference and perContainerReference fields.

natasha41575 · 2026-05-22T15:37:05Z

-* Until Kubelet is made DRA-aware for node allocatable resources (a non-goal for Alpha), QoS and node-level
-  enforcement will not fully reflect DRA allocations. This is an accepted limitation for the initial
-  Alpha scope.
+* While the Kubelet considers DRA for cgroup enforcement, QoS class classification remains purely based on the standard Spec.


moving my comment from elsewhere to here

As we change the resource model, I feel existing QoS class classification is becoming outdated and no longer honors its original intention. In the past, QoS class could tell you something about how the pod is consuming physical resources and therefore the Kubelet could make a reasonable decision about who to evict, but between this KEP and the proposal to change QoS shape I feel that our direction is making QoS class a bit meaningless / arbitrary.

It's been proposed before, but a mitigation could be to allow users to explicitly set a QoS class. I think revisiting what QoS class means and how it should be enforced should be discussed as a separate effort / KEP though.

cc @tallclair

pravk03 · 2026-05-26T16:18:08Z

+2.  **Kube-Scheduler Changes**: Modifications in `NodeResourcesFit` and `DynamicResources` plugins to synchronize node resource usage tracking, delegating authoritative node-fit checks to the `DynamicResources` plugin when a pod utilizes DRA claims.
+3.  **Kubelet Changes**: Updates in Kubelet to take into account resources allocated through DRA in the cgroup enforcement.
+
+### Conceptual Mapping: Pod Spec Requests and Limits with DRA


@johnbelamaric @liggitt FYI. I have updated the KEP and added this section since our last discussion. The DRA based allocation would be considered as both requests and limits at the scheduler and the node.

pravk03 · 2026-05-26T17:49:36Z

/cc @mrunalp (for SIG Node approvals)

pravk03 · 2026-05-26T21:08:52Z

/assign @tallclair

johnbelamaric · 2026-05-27T21:14:48Z

+    // blocks sharing direct-mapped device claims across multiple pods.
+    // +optional
+    // +oneOf=MappingType
+    Direct *NodeAllocatableDirectMapping `json:"direct,omitempty" protobuf:"bytes,1,opt,name=direct"`


I think this name is OK. An alternative could be to name the parent field nodeAllocatableResources and rename "direct" to "mapping", something like:

device: nodeAllocatableResources: "cpu": mapping: capacityKey: "dra.example.com/cores" allocationMultiplier: "2" "memory": overhead: perPodReference: "1Gi"

Thanks. Good suggestion. I am ok with either option here.

johnbelamaric · 2026-05-27T21:36:45Z

-fields, such as `AccountingPolicy`, to the `NodeAllocatableResourceMapping` struct to specify the desired policy. The impact of 
-these accounting policies on existing features like Pod Level Resources and In-Place Pod Vertical Scaling also
-needs more consideration.
+Once the scheduler selects a node and resolves DRA claim allocations, it sums the pod spec standard requests with the newly calculated DRA cgroup-burst resource requests. It evaluates this unified footprint against the remaining namespace `ResourceQuota`. If the computed usage exceeds the remaining quota, the node is filtered out during the scheduling cycle.


It seems quota probably should be taken into account during the Prioritized List device evaluation process. cc @mortent @pohly

ffromani

did another pass to the node enforcement portion.
Apologies if I'm missing something in the doc about the cpu quota disable for integral exclusive CPUS

ffromani · 2026-05-28T11:51:42Z

+CPU Shares      = MilliCPUToShares( Sum(Spec.Requests[cpu]) + DRADirectMapped(cpu) + DRAOverheadMappedPodTotal(cpu) )
+CPU Quota       = Sum(Spec.Limits[cpu]) + DRADirectMapped(cpu) + DRAOverheadMappedPodTotal(cpu)
+Memory Limit    = Sum(Spec.Limits[memory]) + DRADirectMapped(memory) + DRAOverheadMappedPodTotal(memory)
+HugePages Limit = Sum(Spec.Limits[hugepages-<size>]) + DRADirectMapped(hugepages-<size>) + DRAOverheadMappedPodTotal(hugepages-<size>)


I'm torn if we should mention that hugepages limit translates to the hugetlb.<size>.max setting AND the hugetlb.<size>.rsvd.max setting. Could be just a note somewhere once. Probably not worth?

ffromani · 2026-05-28T12:23:59Z

+
+###### Long-Term Mitigation - Explicit QoS Class
+
+A robust long-term solution would be to allow workloads to declare an explicit QoS class directly in the Pod Spec, rather than relying on implicit derivations inside Kubelet. 


question: is KEP 3008 relevant here also? cc @marquiz @kad

I wasn't aware of this work. Thanks for sharing. Yes, It seems very relevant, and I will read more.

pravk03 · 2026-05-28T18:03:33Z

/assign @dom4ha @johnbelamaric @ffromani
/label api-review

k8s-ci-robot · 2026-05-29T01:17:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pravk03
Once this PR has been reviewed and has the lgtm label, please ask for approval from dom4ha. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani · 2026-06-03T16:43:21Z

+**Risk**: With Kubelet enforcing quotas while the DRA driver allocates exclusive physical CPUs, the workload could experience the same throttling issues as in [issue 70585](https://github.com/kubernetes/kubernetes/issues/70585).
+**Mitigation**: While the DRA driver can use container-level hooks to override Kubelet's defaults and set the container cgroup to unlimited, it cannot modify Kubelet-managed 
+pod-level parent cgroups. To mitigate this, the container requesting exclusive CPUs through the DRA claim can skip setting limits in the container spec. Under this configuration, Kubelet's cgroup manager natively skips quota configuration at both container and pod levels and they remain unlimited (`cpu.max = -1`).


Thanks for adding this section. While the mitigation won't avoid the regression, I concur there's not much we can do in the scope of this already massive KEP. We can perhaps revisit this point in the beta cycle, because after all it's a regression whose best mitigation creates a awkward UX.

I believe a proper fix would involve some more handshakes between the DRA drivers and the kubelet, handshakes which we will need to design.

Yes, I agree. I left a note about the long term mitigation.

kad · 2026-06-04T14:52:00Z

+    // Direct is used when the device directly models a node allocatable resource like standard CPU or memory
+    // (e.g., with a CPU DRA driver). The calculated quantity is accounted for exactly once per claim instance
+    // on the node. To prevent node cgroup isolation friction, the scheduler explicitly
+    // blocks sharing direct-mapped device claims across multiple pods.


This blocking is not really good. Use case: fabric attached memory. It is mapped 1:1 practically to native memory resource, it can be shared between pods that are sharing same claim, but it is not an "overhead" type as it not "overflows" or not "shared" with rest of the system (separate NUMA node).

The reason for blocking sharing right now is that we currently do not have a mechanism to update cgroup settings for already-running pods when a new pod attaches to an existing claim.

The topic of sharable and non-sharable claims seems to extend beyond node allocatable devices, as it came up in few other threads as well.

@pohly @johnbelamaric Do you think its worth exploring a common solution for configuring this ?. In this specific case, we would need the configuration at the device level, and another control we want is the ability to prevent sharing across pods, or even within containers of the same pod.

Seems like it keeps coming up. We have allowMultipleAllocations at the device level for sharing devices between claims. This would be some control that decides if a claim is sharable between containers and pods. Maybe we need a sharingPolicy: {None, SamePod, MultiPod} or something, at the Device level, that controls the shareability of claims that contain that device?

this would be a separate KEP, I think

ffromani · 2026-06-04T14:59:22Z

provisional LGTM from my side for the node enforcement portion. Will have a final pass ASAP, hopefully the end of this week.

ffromani · 2026-06-05T10:20:19Z

LGTM for the node enforcement portion. This is a good incremental update.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 15, 2026

k8s-ci-robot requested review from dom4ha and macsko May 15, 2026 06:30

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 15, 2026

github-project-automation Bot added this to SIG Scheduling May 15, 2026

github-project-automation Bot moved this to Needs Triage in SIG Scheduling May 15, 2026

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 15, 2026

pravk03 mentioned this pull request May 15, 2026

DRA: Node Allocatable Resource Requests #5517

Open

8 tasks

pravk03 force-pushed the native-dra-alpha2 branch 2 times, most recently from a1ba25a to e2e59ce Compare May 15, 2026 17:55

pravk03 marked this pull request as ready for review May 15, 2026 18:00

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2026

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 15, 2026

pravk03 force-pushed the native-dra-alpha2 branch 4 times, most recently from 2961cb8 to 78582e3 Compare May 19, 2026 23:42

natasha41575 reviewed May 20, 2026

View reviewed changes

pravk03 force-pushed the native-dra-alpha2 branch 3 times, most recently from c98aec5 to 2354869 Compare May 21, 2026 01:25

bart0sh reviewed May 21, 2026

View reviewed changes

Comment thread keps/sig-scheduling/5517-dra-node-allocatable-resources/README.md

Comment thread keps/sig-scheduling/5517-dra-node-allocatable-resources/README.md

Comment thread keps/sig-scheduling/5517-dra-node-allocatable-resources/README.md

pravk03 force-pushed the native-dra-alpha2 branch from 2354869 to 4aecbc2 Compare May 21, 2026 17:14

ffromani reviewed May 22, 2026

View reviewed changes

natasha41575 reviewed May 22, 2026

View reviewed changes

pravk03 force-pushed the native-dra-alpha2 branch from 4aecbc2 to d50b636 Compare May 22, 2026 23:23

pravk03 commented May 26, 2026

View reviewed changes

k8s-ci-robot assigned tallclair May 26, 2026

pravk03 force-pushed the native-dra-alpha2 branch 2 times, most recently from 2d86e86 to f811f47 Compare May 27, 2026 02:23

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 27, 2026

johnbelamaric reviewed May 27, 2026

View reviewed changes

pravk03 force-pushed the native-dra-alpha2 branch from f811f47 to 61a223e Compare May 28, 2026 00:36

ffromani reviewed May 28, 2026

View reviewed changes

k8s-ci-robot assigned dom4ha May 28, 2026

k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label May 28, 2026

k8s-ci-robot assigned ffromani and johnbelamaric May 28, 2026

github-project-automation Bot added this to API Reviews May 28, 2026

pravk03 force-pushed the native-dra-alpha2 branch from 61a223e to 47dbb5e Compare May 29, 2026 01:17

pravk03 force-pushed the native-dra-alpha2 branch 2 times, most recently from 8e162eb to eb1a85e Compare May 29, 2026 17:24

bart0sh reviewed Jun 3, 2026

View reviewed changes

Comment thread keps/sig-scheduling/5517-dra-node-allocatable-resources/README.md Outdated

bart0sh reviewed Jun 3, 2026

View reviewed changes

Comment thread keps/sig-scheduling/5517-dra-node-allocatable-resources/README.md Outdated

bart0sh reviewed Jun 3, 2026

View reviewed changes

Comment thread keps/sig-scheduling/5517-dra-node-allocatable-resources/README.md Outdated

ffromani reviewed Jun 3, 2026

View reviewed changes

pravk03 force-pushed the native-dra-alpha2 branch from eb1a85e to 18c6a4a Compare June 3, 2026 22:39

KEP-5517: DRA Node Allocatable Resources alpha2 changes

59429a1

pravk03 force-pushed the native-dra-alpha2 branch from 18c6a4a to 59429a1 Compare June 3, 2026 22:56

kad reviewed Jun 4, 2026

View reviewed changes


		###### Long-Term Mitigation - Explicit QoS Class

		A robust long-term solution would be to allow workloads to declare an explicit QoS class directly in the Pod Spec, rather than relying on implicit derivations inside Kubelet.

Conversation

pravk03 commented May 15, 2026

Uh oh!

k8s-ci-robot commented May 15, 2026

Uh oh!

pravk03 commented May 15, 2026

Uh oh!

natasha41575 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pravk03 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pravk03 commented May 26, 2026

Uh oh!

pravk03 commented May 26, 2026

Uh oh!

johnbelamaric May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

pravk03 May 22, 2026 •

edited

Loading

johnbelamaric May 27, 2026 •

edited

Loading