KEP-5517: DRA Node Allocatable Resources alpha2 changes#6082
Conversation
pravk03
commented
May 15, 2026
- One-line PR description: DRA Node Allocatable Resources update for alpha2
- Issue link: DRA: Node Allocatable Resource Requests #5517
- Other comments:
|
Skipping CI for Draft Pull Request. |
a1ba25a to
e2e59ce
Compare
|
/sig node |
2961cb8 to
78582e3
Compare
natasha41575
left a comment
There was a problem hiding this comment.
For posterity: we'd previously discussed a potential implementation where we use the allocation manager as the source of truth, by overwriting the allocated pod spec limits to be the aggregate spec limits + dra resources, and leaving the kuberuntime manager code untouched. But I think the proposal as-is cleanly handles the complex use cases of shared claims. It does however mean we have to be more careful to ensure we cover every part of kubelet that may be taking assumptions about the pod allocation.
We might need to revisit some of this when dra is made mutable, but we don't need to worry about that now.
| * **Behavior with DRA**: No change. To ensure steady-state reconciliation loops (`computePodResizeAction`) | ||
| do not trigger unnecessary CRI updates or cgroup resets, Kubelet maintains the internal `actuated` | ||
| checkpoint strictly limited to standard Spec requests and limits. DRA allocations are excluded from | ||
| the checkpoint. |
There was a problem hiding this comment.
This solution is very elegant. However, we need to be mindful that this introduces an intentional, hidden abstraction layer where the internal actuation checkpoint deliberately diverges from the actual cgroup limits. It works perfectly for preventing reconciliation loops, but we should explicitly call out and carefully document this structural divergence.
There was a problem hiding this comment.
I've added a note to capture that this divergence is a deliberate design choice.
There was a problem hiding this comment.
I understand why the divergence exist. However this could result in observability gap. Do you think it makes sense to add the DRA-allocated resources to the PodSpec resources in PodStatus though?
There was a problem hiding this comment.
Do you think it makes sense to add the DRA-allocated resources to the PodSpec resources in PodStatus though?
DRA allocations are currently surfaced in PodStatus under status.nodeAllocatableResourceClaimStatuses and the status limits will also reflect the DRA resources.
Are you suggesting we should also include them in the allocatedResources (e.g., status.containerStatuses[*].allocatedResources), or did you have a different tracking field in mind?
If yes, I think we'd have to nest it under a new field, since existing allocatedResources field only use requests (not limits). Do you think it's necessary to duplicate the DRA limits under allocatedResources?
We emit events when a pod is resized, and the event emits the entire pod allocation. Do we emit any events when a new pod is admitted? It might be enough for observability to ensure that when such an event is emitted, the emitted allocation in the event includes DRA resources.
There was a problem hiding this comment.
If yes, I think we'd have to nest it under a new field, since existing allocatedResources field only use requests (not limits). Do you think it's necessary to duplicate the DRA limits under allocatedResources?
+1. Because allocatedResources is a flat map (map[ResourceName]resource.Quantity), introducing DRA specific fields inside it is not possible, and adding a new struct field like status.allocatedDRAResources would introduce a redundant overlap with .status.nodeAllocatableResourceClaimStatuses. I would prefer for allocatedResources status (at both the pod and container level) to remain strictly Spec based, and resources status (at both the pod and container level) reflects what is read from the cgroup (which includes the DRA driver's enforcement). To make this clear, we could update the API documentation to explicitly state that allocatedResources does not include DRA-based requests and a reference to status.nodeAllocatableResourceClaimStatuses for DRA information.
| * **Behavior with DRA**: No change. Because Kubelet does not update cgroup requests based on DRA claims | ||
| (keeping CPU shares pure to standard Spec), the `allocated` checkpoint and reported `allocatedResources` | ||
| remain strictly limited to standard Spec requests and limits. DRA allocations are completely excluded. |
There was a problem hiding this comment.
Related to my other comment: This solution is simple and elegant but may be a bit unexpected. We are not checkpointing DRA resources as part of the allocation, but DRA resources are inherently evaluated when kubelet performs its allocation (resource fit) checks via canAdmitPod. This means that the DRA resources are practically part of the pod's allocation; just not recorded explicitly in the checkpoint.
This works because DRA resources are immutable. We will need to take care to audit all parts of kubelet that may be making assumptions that the kubelet allocation checkpoint is the source of truth for the entire pod footprint. I think your KEP does a good job covering it, but it should be documented very clearly in the code too.
We'd also have to audit the output messages in Pending / Deferred conditions and events to make sure they are still saying something reasonable and meaningful to users too.
I considered suggesting that we add the DRA resources to the allocation / actuation checkpoint for the sake of cleanly tracking, but given that it's not necessary at this point in time, it's probably not worth it.
There was a problem hiding this comment.
Good point. Yes, we rely on the fact that DRA resources are (currently) immutable after allocation and this simplifies the design a lot. I have updated the ### Kubelet Admission Control section to include the above details. I will also make sure to include these design decisions as comments in the code.
c98aec5 to
2354869
Compare
2354869 to
4aecbc2
Compare
ffromani
left a comment
There was a problem hiding this comment.
made a first full pass. Great work here. The only unavoidable challenge is that the sheer size of this work makes it hard to keep everything in mind. But I'm confident few more passes will help me.
I may have added comments that actually have answers later in the doc. Sorry about that, I can only say this is another byproduct of the size of the KEP. Please bear with me :)
The part I'm most wary atm (and the main/only reason is my limited experience in the area) is interaction with IPPR, but I see SMEs already engaging, so I think we're good it.
Overall this is great and I didn't spot any obvious issue, but more passes are warranted anyway to ensure the work gets the careful review it deserves.
| * Until Kubelet is made DRA-aware for node allocatable resources (a non-goal for Alpha), QoS and node-level | ||
| enforcement will not fully reflect DRA allocations. This is an accepted limitation for the initial | ||
| Alpha scope. | ||
| * While the Kubelet considers DRA for cgroup enforcement, QoS class classification remains purely based on the standard Spec. |
There was a problem hiding this comment.
true, but not clear why this is a risk and, if so, what's the mitigation
There was a problem hiding this comment.
moving my comment from elsewhere to here
As we change the resource model, I feel existing QoS class classification is becoming outdated and no longer honors its original intention. In the past, QoS class could tell you something about how the pod is consuming physical resources and therefore the Kubelet could make a reasonable decision about who to evict, but between this KEP and the proposal to change QoS shape I feel that our direction is making QoS class a bit meaningless / arbitrary.
It's been proposed before, but a mitigation could be to allow users to explicitly set a QoS class. I think revisiting what QoS class means and how it should be enforced should be discussed as a separate effort / KEP though.
cc @tallclair
There was a problem hiding this comment.
Thanks @ffromani. I can include a section about risks of not updating QOS and potential sectiosn.
@natasha41575 I completely agree. This has come up multiple times in the initial discussions on this KEP. Instead of adding even more variables to how QoS class is inferred (container spec, pod-level resources, IPPR, DRA), I would prefer we revisit the decision to implicitly infer QoS and explore making it explicitly definable in the pod spec.
There was a problem hiding this comment.
@natasha41575 from my side, the current implicit determination of QoS is not a good design in overall, but it is something that is deep in tech debt and can't be easily changed without breakages. My intention few years back with KEP 3008 was exactly to have QoS explicitly declarable, and this KEP would be able to benefit from it... but for sake of scope of this kep, I think better not to touch QoS calculations. let it be compatible with existing ecosystem.
There was a problem hiding this comment.
but for sake of scope of this kep, I think better not to touch QoS calculations. let it be compatible with existing ecosystem.
100% agree that it should be left out of scope of this KEP.
| containers: ["c1", "c2"] | ||
| overhead: | ||
| - name: cpu | ||
| quantity: "2" # Pre-resolved total sum: 1 CPU per pod + (500m * 2 containers) |
There was a problem hiding this comment.
the example made me realize this value is slightly confusing because we kinda conflate the originating values from NodeAllocatableOverhead with their computed value quantity which is IMO more akin a status field or anyway a derived value of sorts.
Honestly: maybe it's just me. I can't say if it's worth another level of nesting to separate the computed values. Probably not?
There was a problem hiding this comment.
Good point. I don't have a strong preference, but here are a few more options we could consider:
- We could remove the
quantityfield entirely from this strict.perPodReferenceandperContainerReferenceare sufficient to calculate the cgroup settings. But this forces us to do the math to figure out the total allocated quantity:perPodReference + numContianerRef * perContianerReference. - Would renaming the
quantityfield make it more obvious ?- Alternatives: total, allocated
cc @liggitt (for API review feedback on this)
There was a problem hiding this comment.
After thinking a bit more, having duplicate information is confusing. I have removed the quantity field, we can get all the information from perPodReference and perContainerReference fields.
| * Until Kubelet is made DRA-aware for node allocatable resources (a non-goal for Alpha), QoS and node-level | ||
| enforcement will not fully reflect DRA allocations. This is an accepted limitation for the initial | ||
| Alpha scope. | ||
| * While the Kubelet considers DRA for cgroup enforcement, QoS class classification remains purely based on the standard Spec. |
There was a problem hiding this comment.
moving my comment from elsewhere to here
As we change the resource model, I feel existing QoS class classification is becoming outdated and no longer honors its original intention. In the past, QoS class could tell you something about how the pod is consuming physical resources and therefore the Kubelet could make a reasonable decision about who to evict, but between this KEP and the proposal to change QoS shape I feel that our direction is making QoS class a bit meaningless / arbitrary.
It's been proposed before, but a mitigation could be to allow users to explicitly set a QoS class. I think revisiting what QoS class means and how it should be enforced should be discussed as a separate effort / KEP though.
cc @tallclair
4aecbc2 to
d50b636
Compare
| 2. **Kube-Scheduler Changes**: Modifications in `NodeResourcesFit` and `DynamicResources` plugins to synchronize node resource usage tracking, delegating authoritative node-fit checks to the `DynamicResources` plugin when a pod utilizes DRA claims. | ||
| 3. **Kubelet Changes**: Updates in Kubelet to take into account resources allocated through DRA in the cgroup enforcement. | ||
|
|
||
| ### Conceptual Mapping: Pod Spec Requests and Limits with DRA |
There was a problem hiding this comment.
@johnbelamaric @liggitt FYI. I have updated the KEP and added this section since our last discussion. The DRA based allocation would be considered as both requests and limits at the scheduler and the node.
|
/cc @mrunalp (for SIG Node approvals) |
|
/assign @tallclair |
2d86e86 to
f811f47
Compare
| // blocks sharing direct-mapped device claims across multiple pods. | ||
| // +optional | ||
| // +oneOf=MappingType | ||
| Direct *NodeAllocatableDirectMapping `json:"direct,omitempty" protobuf:"bytes,1,opt,name=direct"` |
There was a problem hiding this comment.
I think this name is OK. An alternative could be to name the parent field nodeAllocatableResources and rename "direct" to "mapping", something like:
device:
nodeAllocatableResources:
"cpu":
mapping:
capacityKey: "dra.example.com/cores"
allocationMultiplier: "2"
"memory":
overhead:
perPodReference: "1Gi"There was a problem hiding this comment.
Thanks. Good suggestion. I am ok with either option here.
| fields, such as `AccountingPolicy`, to the `NodeAllocatableResourceMapping` struct to specify the desired policy. The impact of | ||
| these accounting policies on existing features like Pod Level Resources and In-Place Pod Vertical Scaling also | ||
| needs more consideration. | ||
| Once the scheduler selects a node and resolves DRA claim allocations, it sums the pod spec standard requests with the newly calculated DRA cgroup-burst resource requests. It evaluates this unified footprint against the remaining namespace `ResourceQuota`. If the computed usage exceeds the remaining quota, the node is filtered out during the scheduling cycle. |
f811f47 to
61a223e
Compare
ffromani
left a comment
There was a problem hiding this comment.
did another pass to the node enforcement portion.
Apologies if I'm missing something in the doc about the cpu quota disable for integral exclusive CPUS
| CPU Shares = MilliCPUToShares( Sum(Spec.Requests[cpu]) + DRADirectMapped(cpu) + DRAOverheadMappedPodTotal(cpu) ) | ||
| CPU Quota = Sum(Spec.Limits[cpu]) + DRADirectMapped(cpu) + DRAOverheadMappedPodTotal(cpu) | ||
| Memory Limit = Sum(Spec.Limits[memory]) + DRADirectMapped(memory) + DRAOverheadMappedPodTotal(memory) | ||
| HugePages Limit = Sum(Spec.Limits[hugepages-<size>]) + DRADirectMapped(hugepages-<size>) + DRAOverheadMappedPodTotal(hugepages-<size>) |
There was a problem hiding this comment.
I'm torn if we should mention that hugepages limit translates to the hugetlb.<size>.max setting AND the hugetlb.<size>.rsvd.max setting. Could be just a note somewhere once. Probably not worth?
|
|
||
| ###### Long-Term Mitigation - Explicit QoS Class | ||
|
|
||
| A robust long-term solution would be to allow workloads to declare an explicit QoS class directly in the Pod Spec, rather than relying on implicit derivations inside Kubelet. |
There was a problem hiding this comment.
I wasn't aware of this work. Thanks for sharing. Yes, It seems very relevant, and I will read more.
|
/assign @dom4ha @johnbelamaric @ffromani |
61a223e to
47dbb5e
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pravk03 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
8e162eb to
eb1a85e
Compare
| **Risk**: With Kubelet enforcing quotas while the DRA driver allocates exclusive physical CPUs, the workload could experience the same throttling issues as in [issue 70585](https://github.com/kubernetes/kubernetes/issues/70585). | ||
| **Mitigation**: While the DRA driver can use container-level hooks to override Kubelet's defaults and set the container cgroup to unlimited, it cannot modify Kubelet-managed | ||
| pod-level parent cgroups. To mitigate this, the container requesting exclusive CPUs through the DRA claim can skip setting limits in the container spec. Under this configuration, Kubelet's cgroup manager natively skips quota configuration at both container and pod levels and they remain unlimited (`cpu.max = -1`). |
There was a problem hiding this comment.
Thanks for adding this section. While the mitigation won't avoid the regression, I concur there's not much we can do in the scope of this already massive KEP. We can perhaps revisit this point in the beta cycle, because after all it's a regression whose best mitigation creates a awkward UX.
I believe a proper fix would involve some more handshakes between the DRA drivers and the kubelet, handshakes which we will need to design.
There was a problem hiding this comment.
Yes, I agree. I left a note about the long term mitigation.
eb1a85e to
18c6a4a
Compare
18c6a4a to
59429a1
Compare
| // Direct is used when the device directly models a node allocatable resource like standard CPU or memory | ||
| // (e.g., with a CPU DRA driver). The calculated quantity is accounted for exactly once per claim instance | ||
| // on the node. To prevent node cgroup isolation friction, the scheduler explicitly | ||
| // blocks sharing direct-mapped device claims across multiple pods. |
There was a problem hiding this comment.
This blocking is not really good. Use case: fabric attached memory. It is mapped 1:1 practically to native memory resource, it can be shared between pods that are sharing same claim, but it is not an "overhead" type as it not "overflows" or not "shared" with rest of the system (separate NUMA node).
There was a problem hiding this comment.
The reason for blocking sharing right now is that we currently do not have a mechanism to update cgroup settings for already-running pods when a new pod attaches to an existing claim.
The topic of sharable and non-sharable claims seems to extend beyond node allocatable devices, as it came up in few other threads as well.
@pohly @johnbelamaric Do you think its worth exploring a common solution for configuring this ?. In this specific case, we would need the configuration at the device level, and another control we want is the ability to prevent sharing across pods, or even within containers of the same pod.
There was a problem hiding this comment.
Seems like it keeps coming up. We have allowMultipleAllocations at the device level for sharing devices between claims. This would be some control that decides if a claim is sharable between containers and pods. Maybe we need a sharingPolicy: {None, SamePod, MultiPod} or something, at the Device level, that controls the shareability of claims that contain that device?
There was a problem hiding this comment.
this would be a separate KEP, I think
|
provisional LGTM from my side for the node enforcement portion. Will have a final pass ASAP, hopefully the end of this week. |
|
LGTM for the node enforcement portion. This is a good incremental update. |