Skip to content

Commit 5db2605

Browse files
committed
Addressed more comments
1 parent 2c98d9d commit 5db2605

File tree

1 file changed

+46
-29
lines changed
  • keps/sig-scheduling/5194-reserved-for-workloads

1 file changed

+46
-29
lines changed

keps/sig-scheduling/5194-reserved-for-workloads/README.md

Lines changed: 46 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -336,7 +336,7 @@ references a `ResourceClaim` with an empty `status.ReservedFor` list, it knows t
336336
is the first pod that will be consuming the claim.
337337

338338
If the `spec.ReservedFor` field in the ResourceClaim is not set, the scheduler will handle
339-
the `ResourceClaim` in the same was as now, and will add the `Pod` to the `ReservedFor` list
339+
the `ResourceClaim` in the same way as now, and will add the `Pod` to the `ReservedFor` list
340340
if devices could be allocated for the claim. Any additional pods that reference the `ResourceClaim`
341341
will also be added to the list.
342342

@@ -348,24 +348,24 @@ list, it will not add a reference to the pod.
348348

349349
##### Deallocation
350350
The resourceclaim controller will remove Pod references from the `ReservedFor` list just
351-
like it does now using the same logic. For non-Pod references, the controller will recognize
352-
a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will
353-
remove the reference from the list when those resources are removed. For other types,
354-
it will be the responsibility of the workload controller/user that created the `ResourceClaim`
355-
to remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
351+
like it does now using the same logic. But for non-Pod references, it
352+
will be the responsibility of the controller/user that created the `ResourceClaim` to
353+
remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
356354
are consuming the `ResourceClaim` and no new pods will be created that references
357355
the `ResourceClaim`.
358356

359357
The resourceclaim controller will then discover that the `ReservedFor` list is empty
360358
and therefore know that it is safe to deallocate the `ResourceClaim`.
361359

362-
This requires that the resourceclaim controller watches the workload types that will
363-
be supported. For other types of workloads, there will be a requirement that the workload
364-
controller has permissions to update the status subresource of the `ResourceClaim`. The
365-
resourceclaim controller will also try to detect if an unknown resource referenced in the
366-
`ReservedFor` list has been deleted from the cluster, but that requires that the controller
367-
has permissions to get or list resources of the type. If the resourceclaim controller is
368-
not able to check, it will just wait until the reference in the `ReservedFor` list is removed.
360+
This requires that the controller/user has permissions to update the status
361+
subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if
362+
the resource referenced in the `ReservedFor` list has been deleted from the cluster, but
363+
that requires that the controller has permissions to get or list resources of the type. If the
364+
resourceclaim controller is not able to check, it will just wait until the reference in
365+
the `ReservedFor` list is removed. The resourceclaim controller will not have a watch
366+
on the workload resource, so there is no guarantee that the controller will realize that
367+
the resource has been deleted. This is an extra check since it is the responsibility of
368+
the workload controller to update the claim.
369369

370370
##### Finding pods using a ResourceClaim
371371
If the reference in the `ReservedFor` list is to a non-Pod resource, controllers can no longer
@@ -386,7 +386,7 @@ but haven't yet been scheduled. This distinction is important for some of the us
386386
1. If the DRA scheduler plugin is trying to find candidates for deallocation in
387387
the `PostFilter` function and sees a `ResourceClaim` with a non-Pod reference, it will not
388388
attempt to deallocate. The plugin has no way to know how many pods are actually consuming
389-
the `ResourceClaim` without the explit list in the `ReservedFor` list and therefore it will
389+
the `ResourceClaim` without the explicit list in the `ReservedFor` list and therefore it will
390390
not be safe to deallocate.
391391

392392
1. The device_taint_eviction controller will use the list of pods referencing the `ResourceClaim`
@@ -473,6 +473,19 @@ The API server will no longer accept the new fields and the other components wil
473473
not know what to do with them. So the result is that the `ReservedFor` list will only
474474
have references to pod resources like today.
475475

476+
Any ResourceClaims that have already been allocated when the feature was active will
477+
have non-pod references in the `ReservedFor` list after a downgrade, but the controllers
478+
will not know how to handle it. There are two problems that will arise as a result of
479+
this:
480+
- The workload controller will also have been downgraded if it is in-tree, meaning that
481+
it will not remove the reference to workload resource from the `ReservedFor` list, thus
482+
leading to a situation where the claim will never be deallocated.
483+
- For new pods that gets scheduled, the scheduler will add pod references in the
484+
`ReservedFor` list, despite there being a non-pod reference here. So it ends up with
485+
both pod and non-pod references in the list. We need to make sure the system can
486+
handle this, as it might also happen as a result of disablement and the enablement
487+
of the feature.
488+
476489
### Version Skew Strategy
477490

478491
If the kubelet is on a version that doesn't support the feature but the rest of the
@@ -482,10 +495,9 @@ since it will still check whether the `Pod` is references in the `ReservedFor` l
482495
If the API server is on a version that supports the feature, but the scheduler
483496
is not, the scheduler will not know about the new fields added, so it will
484497
put the reference to the `Pod` in the `ReservedFor` list rather than the reference
485-
in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but
486-
it will be subject to the 256 limit on the size of the `ReservedFor` list and the
487-
controller creating the `ResourceClaim` will not find the reference it expects
488-
in the `ReservedFor` list when it tries to remove it.
498+
in the `spec.ReservedFor` list. It will do this even if there is already a non-pod
499+
reference in the `spec.ReservedFor` list. This leads to the challenge described
500+
in the previous section.
489501

490502
## Production Readiness Review Questionnaire
491503

@@ -543,18 +555,21 @@ No
543555

544556
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
545557

546-
Applications that were already running will continue to run and the allocated
547-
devices will remain so.
548-
For the resource types supported directly, the resource claim controller will not remove the
549-
reference in the `ReservedFor` list, meaning the devices will not be deallocated. If the workload
550-
controller is responsible for removing the reference, deallocation will work as long as the
551-
feature isn't also disabled in the controllers. If they are, deallocation will not happen in this
552-
situation either.
558+
Applications that were already running will continue to run. But if a pod have to be
559+
re-admitted by a kubelet where the feature has been disabled, it will not be able to, since
560+
the kubelet will not find a reference to the pod in the `ReservedFor` list.
561+
562+
The feature will also be disabled for in-tree workload controllers, meaning that they will
563+
not remove the reference to the pod from the `ReservedFor` list. This means the list will never
564+
be empty and the resourceclaim controller will never deallocate the claim.
553565

554566
###### What happens if we reenable the feature if it was previously rolled back?
555567

556568
It will take affect again and will impact how the `ReservedFor` field is used during allocation
557-
and deallocation.
569+
and deallocation. Since this scenario allows a ResourceClaim with the `spec.ReservedFor` field
570+
to be set and then have the scheduler populate the `ReservedFor` list with pods when the feature
571+
is disabled, we will end up in a situation where the `ReservedFor` list can contain both non-pod
572+
and pod references. We need to make sure all components can handle that.
558573

559574
###### Are there any tests for feature enablement/disablement?
560575

@@ -723,7 +738,10 @@ No
723738

724739
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
725740

726-
No
741+
Yes and no. We are adding two new fields to the ResourceClaim type, but neither are of a collection type
742+
so they should have limited impact on the total size of the objects. However, this feature means that
743+
we no longer need to keep a complete list of all pods using a ResourceClaim, which can significantly
744+
reduce the size of ResourceClaim objects shared by many pods.
727745

728746
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
729747

@@ -732,8 +750,7 @@ No
732750
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
733751

734752
It might require some additional memory usage in the resourceclaim controller since it will need to keep an index
735-
of ResourceClaim to Pods. The resourceclaim controller will also have watches for Deployments, StatefulSets, and
736-
Jobs which might also cause a slight increase in memory usage.
753+
of ResourceClaim to Pods.
737754

738755
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
739756

0 commit comments

Comments
 (0)