@@ -336,7 +336,7 @@ references a `ResourceClaim` with an empty `status.ReservedFor` list, it knows t
336
336
is the first pod that will be consuming the claim.
337
337
338
338
If the ` spec.ReservedFor ` field in the ResourceClaim is not set, the scheduler will handle
339
- the ` ResourceClaim ` in the same was as now, and will add the ` Pod ` to the ` ReservedFor ` list
339
+ the ` ResourceClaim ` in the same way as now, and will add the ` Pod ` to the ` ReservedFor ` list
340
340
if devices could be allocated for the claim. Any additional pods that reference the ` ResourceClaim `
341
341
will also be added to the list.
342
342
@@ -348,24 +348,24 @@ list, it will not add a reference to the pod.
348
348
349
349
##### Deallocation
350
350
The resourceclaim controller will remove Pod references from the ` ReservedFor ` list just
351
- like it does now using the same logic. For non-Pod references, the controller will recognize
352
- a small number of built-in types, starting with ` Deployment ` , ` StatefulSet ` and ` Job ` , and will
353
- remove the reference from the list when those resources are removed. For other types,
354
- it will be the responsibility of the workload controller/user that created the ` ResourceClaim `
355
- to remove the reference to the non-Pod resource from the ` ReservedFor ` list when no pods
351
+ like it does now using the same logic. But for non-Pod references, it
352
+ will be the responsibility of the controller/user that created the ` ResourceClaim ` to
353
+ remove the reference to the non-Pod resource from the ` ReservedFor ` list when no pods
356
354
are consuming the ` ResourceClaim ` and no new pods will be created that references
357
355
the ` ResourceClaim ` .
358
356
359
357
The resourceclaim controller will then discover that the ` ReservedFor ` list is empty
360
358
and therefore know that it is safe to deallocate the ` ResourceClaim ` .
361
359
362
- This requires that the resourceclaim controller watches the workload types that will
363
- be supported. For other types of workloads, there will be a requirement that the workload
364
- controller has permissions to update the status subresource of the ` ResourceClaim ` . The
365
- resourceclaim controller will also try to detect if an unknown resource referenced in the
366
- ` ReservedFor ` list has been deleted from the cluster, but that requires that the controller
367
- has permissions to get or list resources of the type. If the resourceclaim controller is
368
- not able to check, it will just wait until the reference in the ` ReservedFor ` list is removed.
360
+ This requires that the controller/user has permissions to update the status
361
+ subresource of the ` ResourceClaim ` . The resourceclaim controller will also try to detect if
362
+ the resource referenced in the ` ReservedFor ` list has been deleted from the cluster, but
363
+ that requires that the controller has permissions to get or list resources of the type. If the
364
+ resourceclaim controller is not able to check, it will just wait until the reference in
365
+ the ` ReservedFor ` list is removed. The resourceclaim controller will not have a watch
366
+ on the workload resource, so there is no guarantee that the controller will realize that
367
+ the resource has been deleted. This is an extra check since it is the responsibility of
368
+ the workload controller to update the claim.
369
369
370
370
##### Finding pods using a ResourceClaim
371
371
If the reference in the ` ReservedFor ` list is to a non-Pod resource, controllers can no longer
@@ -386,7 +386,7 @@ but haven't yet been scheduled. This distinction is important for some of the us
386
386
1 . If the DRA scheduler plugin is trying to find candidates for deallocation in
387
387
the ` PostFilter ` function and sees a ` ResourceClaim ` with a non-Pod reference, it will not
388
388
attempt to deallocate. The plugin has no way to know how many pods are actually consuming
389
- the ` ResourceClaim ` without the explit list in the ` ReservedFor ` list and therefore it will
389
+ the ` ResourceClaim ` without the explicit list in the ` ReservedFor ` list and therefore it will
390
390
not be safe to deallocate.
391
391
392
392
1 . The device_taint_eviction controller will use the list of pods referencing the ` ResourceClaim `
@@ -473,6 +473,19 @@ The API server will no longer accept the new fields and the other components wil
473
473
not know what to do with them. So the result is that the ` ReservedFor ` list will only
474
474
have references to pod resources like today.
475
475
476
+ Any ResourceClaims that have already been allocated when the feature was active will
477
+ have non-pod references in the ` ReservedFor ` list after a downgrade, but the controllers
478
+ will not know how to handle it. There are two problems that will arise as a result of
479
+ this:
480
+ - The workload controller will also have been downgraded if it is in-tree, meaning that
481
+ it will not remove the reference to workload resource from the ` ReservedFor ` list, thus
482
+ leading to a situation where the claim will never be deallocated.
483
+ - For new pods that gets scheduled, the scheduler will add pod references in the
484
+ ` ReservedFor ` list, despite there being a non-pod reference here. So it ends up with
485
+ both pod and non-pod references in the list. We need to make sure the system can
486
+ handle this, as it might also happen as a result of disablement and the enablement
487
+ of the feature.
488
+
476
489
### Version Skew Strategy
477
490
478
491
If the kubelet is on a version that doesn't support the feature but the rest of the
@@ -482,10 +495,9 @@ since it will still check whether the `Pod` is references in the `ReservedFor` l
482
495
If the API server is on a version that supports the feature, but the scheduler
483
496
is not, the scheduler will not know about the new fields added, so it will
484
497
put the reference to the ` Pod ` in the ` ReservedFor ` list rather than the reference
485
- in the ` spec.ReservedFor ` list. As a result, the workload will get scheduled, but
486
- it will be subject to the 256 limit on the size of the ` ReservedFor ` list and the
487
- controller creating the ` ResourceClaim ` will not find the reference it expects
488
- in the ` ReservedFor ` list when it tries to remove it.
498
+ in the ` spec.ReservedFor ` list. It will do this even if there is already a non-pod
499
+ reference in the ` spec.ReservedFor ` list. This leads to the challenge described
500
+ in the previous section.
489
501
490
502
## Production Readiness Review Questionnaire
491
503
543
555
544
556
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
545
557
546
- Applications that were already running will continue to run and the allocated
547
- devices will remain so.
548
- For the resource types supported directly, the resource claim controller will not remove the
549
- reference in the ` ReservedFor ` list, meaning the devices will not be deallocated. If the workload
550
- controller is responsible for removing the reference, deallocation will work as long as the
551
- feature isn't also disabled in the controllers. If they are, deallocation will not happen in this
552
- situation either .
558
+ Applications that were already running will continue to run. But if a pod have to be
559
+ re-admitted by a kubelet where the feature has been disabled, it will not be able to, since
560
+ the kubelet will not find a reference to the pod in the ` ReservedFor ` list.
561
+
562
+ The feature will also be disabled for in-tree workload controllers, meaning that they will
563
+ not remove the reference to the pod from the ` ReservedFor ` list. This means the list will never
564
+ be empty and the resourceclaim controller will never deallocate the claim .
553
565
554
566
###### What happens if we reenable the feature if it was previously rolled back?
555
567
556
568
It will take affect again and will impact how the ` ReservedFor ` field is used during allocation
557
- and deallocation.
569
+ and deallocation. Since this scenario allows a ResourceClaim with the ` spec.ReservedFor ` field
570
+ to be set and then have the scheduler populate the ` ReservedFor ` list with pods when the feature
571
+ is disabled, we will end up in a situation where the ` ReservedFor ` list can contain both non-pod
572
+ and pod references. We need to make sure all components can handle that.
558
573
559
574
###### Are there any tests for feature enablement/disablement?
560
575
723
738
724
739
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
725
740
726
- No
741
+ Yes and no. We are adding two new fields to the ResourceClaim type, but neither are of a collection type
742
+ so they should have limited impact on the total size of the objects. However, this feature means that
743
+ we no longer need to keep a complete list of all pods using a ResourceClaim, which can significantly
744
+ reduce the size of ResourceClaim objects shared by many pods.
727
745
728
746
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
729
747
732
750
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
733
751
734
752
It might require some additional memory usage in the resourceclaim controller since it will need to keep an index
735
- of ResourceClaim to Pods. The resourceclaim controller will also have watches for Deployments, StatefulSets, and
736
- Jobs which might also cause a slight increase in memory usage.
753
+ of ResourceClaim to Pods.
737
754
738
755
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
739
756
0 commit comments