You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UPSTREAM: 132028: podresources: list: use active pods in list
The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.
But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue kubernetes#132020 demonstrates.
For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue kubernetes#119423 seemed fixed
(see: kubernetes#119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).
Taking a step back, we can see multiple problems:
1. not syncing the resource managers internal data before to query for
pod assignment (no removeStaleState calls) but most importantly
2. the List call iterate overs all the pod known to the kubelet. But the
resource managers do NOT hold resources for non-running pod, so it is
better, actually it's correct to iterate only over the active pods.
This will also avoid issue 1 above.
Furthermore, the resource managers all iterate over the active pods
anyway:
`List` is using all the pods known about:
1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L3135 goes in
2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/pod/pod_manager.go#L215
But all the resource managers are using the list of active pods:
1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L1666 goes in
2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet_pods.go#L198
So this change will also make the `List` view consistent with the
resource managers view, which is also a promise of the API currently
broken.
We also need to acknowledge the the warning in the docstring of GetActivePods.
Arguably, having the endpoint using a different podset wrt the resource managers with the
related desync causes way more harm than good.
And arguably, it's better to fix this issue in just one place instead of
having the `List` use a different pod set for unclear reason.
For these reasons, while important, I don't think the warning per se
invalidated this change.
We need to further acknowledge the `List` endpoint used the full pod
list since its inception. So, we will add a Feature Gate to disable this
fix and restore the old behavior. We plan to keep this Feature Gate for
quite a long time (at least 4 more releases) considering how stable this
change was. Should a consumer of the API being broken by this change,
we have the option to restore the old behavior and to craft a more
elaborate fix.
The old `v1alpha1` endpoint will be not modified intentionally.
***RELEASE-4.19 BACKPORT NOTE***
dropped the versioned feature gate entry as we don't have the versioned
geature gates in this version.
Signed-off-by: Francesco Romani <[email protected]>
0 commit comments