Skip to content

Commit 4aa7318

Browse files
committed
UPSTREAM: 132028: podresources: list: use active pods in list
The podresources API List implementation uses the internal data of the resource managers as source of truth. Looking at the implementation here: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60 we take care of syncing the device allocation data before querying the device manager to return its pod->devices assignment. This is needed because otherwise the device manager (and all the other resource managers) would do the cleanup asynchronously, so the `List` call will return incorrect data. But we don't do this syncing neither for CPUs or for memory, so when we report these we will get stale data as the issue kubernetes#132020 demonstrates. For CPU manager, we however have the reconcile loop which cleans the stale data periodically. Turns out this timing interplay was actually the reason the existing issue kubernetes#119423 seemed fixed (see: kubernetes#119423 (comment)). But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time very high (>= 5 minutes), then the issue still reproduces against current master branch (https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983). Taking a step back, we can see multiple problems: 1. not syncing the resource managers internal data before to query for pod assignment (no removeStaleState calls) but most importantly 2. the List call iterate overs all the pod known to the kubelet. But the resource managers do NOT hold resources for non-running pod, so it is better, actually it's correct to iterate only over the active pods. This will also avoid issue 1 above. Furthermore, the resource managers all iterate over the active pods anyway: `List` is using all the pods known about: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L3135 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/pod/pod_manager.go#L215 But all the resource managers are using the list of active pods: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L1666 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet_pods.go#L198 So this change will also make the `List` view consistent with the resource managers view, which is also a promise of the API currently broken. We also need to acknowledge the the warning in the docstring of GetActivePods. Arguably, having the endpoint using a different podset wrt the resource managers with the related desync causes way more harm than good. And arguably, it's better to fix this issue in just one place instead of having the `List` use a different pod set for unclear reason. For these reasons, while important, I don't think the warning per se invalidated this change. We need to further acknowledge the `List` endpoint used the full pod list since its inception. So, we will add a Feature Gate to disable this fix and restore the old behavior. We plan to keep this Feature Gate for quite a long time (at least 4 more releases) considering how stable this change was. Should a consumer of the API being broken by this change, we have the option to restore the old behavior and to craft a more elaborate fix. The old `v1alpha1` endpoint will be not modified intentionally. ***RELEASE-4.19 BACKPORT NOTE*** dropped the versioned feature gate entry as we don't have the versioned geature gates in this version. Signed-off-by: Francesco Romani <[email protected]>
1 parent 38c60a5 commit 4aa7318

File tree

7 files changed

+265
-2
lines changed

7 files changed

+265
-2
lines changed

pkg/features/kube_features.go

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -370,6 +370,27 @@ const (
370370
// Enable POD resources API with Get method
371371
KubeletPodResourcesGet featuregate.Feature = "KubeletPodResourcesGet"
372372

373+
// owner: @ffromani
374+
// Deprecated: v1.34
375+
//
376+
// issue: https://github.com/kubernetes/kubernetes/issues/119423
377+
// Disables restricted output for the podresources API list endpoint.
378+
// "Restricted" output only includes the pods which are actually running and thus they
379+
// hold resources. Turns out this was originally the intended behavior, see:
380+
// https://github.com/kubernetes/kubernetes/pull/79409#issuecomment-505975671
381+
// This behavior was lost over time and interaction with memory manager creates
382+
// an unfixable bug because the endpoint returns spurious stale information the clients
383+
// cannot filter out, because the API doesn't provide enough context. See:
384+
// https://github.com/kubernetes/kubernetes/issues/132020
385+
// The endpoint has returning extra information for long time, but that information
386+
// is also useless for the purpose of this API. Nevertheless, we are changing a long-established
387+
// albeit buggy behavior, so users observing any regressions can use the
388+
// KubeletPodResourcesListUseActivePods/ feature gate (default on) to restore the old behavior.
389+
// Please file issues if you hit issues and have to use this Feature Gate.
390+
// The Feature Gate will be locked to true in +4 releases (1.38) and then removed (1.39)
391+
// if there are no bug reported.
392+
KubeletPodResourcesListUseActivePods featuregate.Feature = "KubeletPodResourcesListUseActivePods"
393+
373394
// owner: @kannon92
374395
// kep: https://kep.k8s.io/4191
375396
//

pkg/kubelet/apis/podresources/server_v1.go

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ import (
2222

2323
v1 "k8s.io/api/core/v1"
2424
utilfeature "k8s.io/apiserver/pkg/util/feature"
25+
"k8s.io/klog/v2"
2526
podutil "k8s.io/kubernetes/pkg/api/v1/pod"
2627
kubefeatures "k8s.io/kubernetes/pkg/features"
2728
"k8s.io/kubernetes/pkg/kubelet/metrics"
@@ -36,17 +37,21 @@ type v1PodResourcesServer struct {
3637
cpusProvider CPUsProvider
3738
memoryProvider MemoryProvider
3839
dynamicResourcesProvider DynamicResourcesProvider
40+
useActivePods bool
3941
}
4042

4143
// NewV1PodResourcesServer returns a PodResourcesListerServer which lists pods provided by the PodsProvider
4244
// with device information provided by the DevicesProvider
4345
func NewV1PodResourcesServer(providers PodResourcesProviders) podresourcesv1.PodResourcesListerServer {
46+
useActivePods := utilfeature.DefaultFeatureGate.Enabled(kubefeatures.KubeletPodResourcesListUseActivePods)
47+
klog.InfoS("podresources", "method", "list", "useActivePods", useActivePods)
4448
return &v1PodResourcesServer{
4549
podsProvider: providers.Pods,
4650
devicesProvider: providers.Devices,
4751
cpusProvider: providers.Cpus,
4852
memoryProvider: providers.Memory,
4953
dynamicResourcesProvider: providers.DynamicResources,
54+
useActivePods: useActivePods,
5055
}
5156
}
5257

@@ -55,7 +60,13 @@ func (p *v1PodResourcesServer) List(ctx context.Context, req *podresourcesv1.Lis
5560
metrics.PodResourcesEndpointRequestsTotalCount.WithLabelValues("v1").Inc()
5661
metrics.PodResourcesEndpointRequestsListCount.WithLabelValues("v1").Inc()
5762

58-
pods := p.podsProvider.GetPods()
63+
var pods []*v1.Pod
64+
if p.useActivePods {
65+
pods = p.podsProvider.GetActivePods()
66+
} else {
67+
pods = p.podsProvider.GetPods()
68+
}
69+
5970
podResources := make([]*podresourcesv1.PodResources, len(pods))
6071
p.devicesProvider.UpdateAllocatedDevices()
6172

pkg/kubelet/apis/podresources/server_v1_test.go

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,12 @@ package podresources
1919
import (
2020
"context"
2121
"fmt"
22+
"sort"
2223
"testing"
2324

2425
"github.com/google/go-cmp/cmp"
2526
"github.com/google/go-cmp/cmp/cmpopts"
27+
"github.com/stretchr/testify/mock"
2628

2729
v1 "k8s.io/api/core/v1"
2830
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@ -221,6 +223,7 @@ func TestListPodResourcesV1(t *testing.T) {
221223
mockDynamicResourcesProvider := podresourcetest.NewMockDynamicResourcesProvider(t)
222224

223225
mockPodsProvider.EXPECT().GetPods().Return(tc.pods).Maybe()
226+
mockPodsProvider.EXPECT().GetActivePods().Return(tc.pods).Maybe()
224227
mockDevicesProvider.EXPECT().GetDevices(string(podUID), containerName).Return(tc.devices).Maybe()
225228
mockCPUsProvider.EXPECT().GetCPUs(string(podUID), containerName).Return(tc.cpus).Maybe()
226229
mockMemoryProvider.EXPECT().GetMemory(string(podUID), containerName).Return(tc.memory).Maybe()
@@ -249,6 +252,159 @@ func TestListPodResourcesV1(t *testing.T) {
249252
}
250253
}
251254

255+
func makePod(idx int) *v1.Pod {
256+
podNamespace := "pod-namespace"
257+
podName := fmt.Sprintf("pod-name-%d", idx)
258+
podUID := types.UID(fmt.Sprintf("pod-uid-%d", idx))
259+
containerName := fmt.Sprintf("container-name-%d", idx)
260+
containers := []v1.Container{
261+
{
262+
Name: containerName,
263+
},
264+
}
265+
return &v1.Pod{
266+
ObjectMeta: metav1.ObjectMeta{
267+
Name: podName,
268+
Namespace: podNamespace,
269+
UID: podUID,
270+
},
271+
Spec: v1.PodSpec{
272+
Containers: containers,
273+
},
274+
}
275+
}
276+
277+
func collectNamespacedNamesFromPods(pods []*v1.Pod) []string {
278+
ret := make([]string, 0, len(pods))
279+
for _, pod := range pods {
280+
ret = append(ret, pod.Namespace+"/"+pod.Name)
281+
}
282+
sort.Strings(ret)
283+
return ret
284+
}
285+
286+
func collectNamespacedNamesFromPodResources(prs []*podresourcesapi.PodResources) []string {
287+
ret := make([]string, 0, len(prs))
288+
for _, pr := range prs {
289+
ret = append(ret, pr.Namespace+"/"+pr.Name)
290+
}
291+
sort.Strings(ret)
292+
return ret
293+
}
294+
295+
func TestListPodResourcesUsesOnlyActivePodsV1(t *testing.T) {
296+
numaID := int64(1)
297+
298+
// we abuse the fact that we don't care about the assignments,
299+
// so we reuse the same for all pods which is actually wrong.
300+
devs := []*podresourcesapi.ContainerDevices{
301+
{
302+
ResourceName: "resource",
303+
DeviceIds: []string{"dev0"},
304+
Topology: &podresourcesapi.TopologyInfo{Nodes: []*podresourcesapi.NUMANode{{ID: numaID}}},
305+
},
306+
}
307+
308+
cpus := []int64{1, 9}
309+
310+
mems := []*podresourcesapi.ContainerMemory{
311+
{
312+
MemoryType: "memory",
313+
Size_: 1073741824,
314+
Topology: &podresourcesapi.TopologyInfo{Nodes: []*podresourcesapi.NUMANode{{ID: numaID}}},
315+
},
316+
{
317+
MemoryType: "hugepages-1Gi",
318+
Size_: 1073741824,
319+
Topology: &podresourcesapi.TopologyInfo{Nodes: []*podresourcesapi.NUMANode{{ID: numaID}}},
320+
},
321+
}
322+
323+
for _, tc := range []struct {
324+
desc string
325+
pods []*v1.Pod
326+
activePods []*v1.Pod
327+
}{
328+
{
329+
desc: "no pods",
330+
pods: []*v1.Pod{},
331+
activePods: []*v1.Pod{},
332+
},
333+
{
334+
desc: "no differences",
335+
pods: []*v1.Pod{
336+
makePod(1),
337+
makePod(2),
338+
makePod(3),
339+
makePod(4),
340+
makePod(5),
341+
},
342+
activePods: []*v1.Pod{
343+
makePod(1),
344+
makePod(2),
345+
makePod(3),
346+
makePod(4),
347+
makePod(5),
348+
},
349+
},
350+
{
351+
desc: "some terminated pods",
352+
pods: []*v1.Pod{
353+
makePod(1),
354+
makePod(2),
355+
makePod(3),
356+
makePod(4),
357+
makePod(5),
358+
makePod(6),
359+
makePod(7),
360+
},
361+
activePods: []*v1.Pod{
362+
makePod(1),
363+
makePod(3),
364+
makePod(4),
365+
makePod(5),
366+
makePod(6),
367+
},
368+
},
369+
} {
370+
t.Run(tc.desc, func(t *testing.T) {
371+
mockDevicesProvider := podresourcetest.NewMockDevicesProvider(t)
372+
mockPodsProvider := podresourcetest.NewMockPodsProvider(t)
373+
mockCPUsProvider := podresourcetest.NewMockCPUsProvider(t)
374+
mockMemoryProvider := podresourcetest.NewMockMemoryProvider(t)
375+
mockDynamicResourcesProvider := podresourcetest.NewMockDynamicResourcesProvider(t)
376+
377+
mockPodsProvider.EXPECT().GetPods().Return(tc.pods).Maybe()
378+
mockPodsProvider.EXPECT().GetActivePods().Return(tc.activePods).Maybe()
379+
mockDevicesProvider.EXPECT().GetDevices(mock.Anything, mock.Anything).Return(devs).Maybe()
380+
mockCPUsProvider.EXPECT().GetCPUs(mock.Anything, mock.Anything).Return(cpus).Maybe()
381+
mockMemoryProvider.EXPECT().GetMemory(mock.Anything, mock.Anything).Return(mems).Maybe()
382+
mockDevicesProvider.EXPECT().UpdateAllocatedDevices().Return().Maybe()
383+
mockCPUsProvider.EXPECT().GetAllocatableCPUs().Return([]int64{}).Maybe()
384+
mockDevicesProvider.EXPECT().GetAllocatableDevices().Return([]*podresourcesapi.ContainerDevices{}).Maybe()
385+
mockMemoryProvider.EXPECT().GetAllocatableMemory().Return([]*podresourcesapi.ContainerMemory{}).Maybe()
386+
387+
providers := PodResourcesProviders{
388+
Pods: mockPodsProvider,
389+
Devices: mockDevicesProvider,
390+
Cpus: mockCPUsProvider,
391+
Memory: mockMemoryProvider,
392+
DynamicResources: mockDynamicResourcesProvider,
393+
}
394+
server := NewV1PodResourcesServer(providers)
395+
resp, err := server.List(context.TODO(), &podresourcesapi.ListPodResourcesRequest{})
396+
if err != nil {
397+
t.Errorf("want err = %v, got %q", nil, err)
398+
}
399+
expectedNames := collectNamespacedNamesFromPods(tc.activePods)
400+
gotNames := collectNamespacedNamesFromPodResources(resp.GetPodResources())
401+
if diff := cmp.Diff(expectedNames, gotNames, cmpopts.EquateEmpty()); diff != "" {
402+
t.Fatal(diff)
403+
}
404+
})
405+
}
406+
}
407+
252408
func TestListPodResourcesWithInitContainersV1(t *testing.T) {
253409
featuregatetesting.SetFeatureGateDuringTest(t, utilfeature.DefaultFeatureGate, pkgfeatures.KubeletPodResourcesDynamicResources, true)
254410

@@ -530,6 +686,7 @@ func TestListPodResourcesWithInitContainersV1(t *testing.T) {
530686
mockDynamicResourcesProvider := podresourcetest.NewMockDynamicResourcesProvider(t)
531687

532688
mockPodsProvider.EXPECT().GetPods().Return(tc.pods).Maybe()
689+
mockPodsProvider.EXPECT().GetActivePods().Return(tc.pods).Maybe()
533690
tc.mockFunc(tc.pods, mockDevicesProvider, mockCPUsProvider, mockMemoryProvider, mockDynamicResourcesProvider)
534691

535692
providers := PodResourcesProviders{

pkg/kubelet/apis/podresources/testing/pods_provider.go

Lines changed: 47 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pkg/kubelet/apis/podresources/types.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ type DevicesProvider interface {
3434

3535
// PodsProvider knows how to provide the pods admitted by the node
3636
type PodsProvider interface {
37+
GetActivePods() []*v1.Pod
3738
GetPods() []*v1.Pod
3839
GetPodByName(namespace, name string) (*v1.Pod, bool)
3940
}

pkg/kubelet/kubelet.go

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3058,6 +3058,22 @@ func (kl *Kubelet) ListenAndServeReadOnly(address net.IP, port uint, tp trace.Tr
30583058
server.ListenAndServeKubeletReadOnlyServer(kl, kl.resourceAnalyzer, kl.containerManager.GetHealthCheckers(), address, port, tp)
30593059
}
30603060

3061+
type kubeletPodsProvider struct {
3062+
kl *Kubelet
3063+
}
3064+
3065+
func (pp *kubeletPodsProvider) GetActivePods() []*v1.Pod {
3066+
return pp.kl.GetActivePods()
3067+
}
3068+
3069+
func (pp *kubeletPodsProvider) GetPods() []*v1.Pod {
3070+
return pp.kl.podManager.GetPods()
3071+
}
3072+
3073+
func (pp *kubeletPodsProvider) GetPodByName(namespace, name string) (*v1.Pod, bool) {
3074+
return pp.kl.podManager.GetPodByName(namespace, name)
3075+
}
3076+
30613077
// ListenAndServePodResources runs the kubelet podresources grpc service
30623078
func (kl *Kubelet) ListenAndServePodResources() {
30633079
endpoint, err := util.LocalEndpoint(kl.getPodResourcesDir(), podresources.Socket)
@@ -3067,7 +3083,7 @@ func (kl *Kubelet) ListenAndServePodResources() {
30673083
}
30683084

30693085
providers := podresources.PodResourcesProviders{
3070-
Pods: kl.podManager,
3086+
Pods: &kubeletPodsProvider{kl: kl},
30713087
Devices: kl.containerManager,
30723088
Cpus: kl.containerManager,
30733089
Memory: kl.containerManager,

test/featuregates_linter/test_data/versioned_feature_list.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -662,6 +662,16 @@
662662
lockToDefault: false
663663
preRelease: Alpha
664664
version: "1.27"
665+
- name: KubeletPodResourcesListUseActivePods
666+
versionedSpecs:
667+
- default: false
668+
lockToDefault: false
669+
preRelease: GA
670+
version: "1.0"
671+
- default: true
672+
lockToDefault: false
673+
preRelease: Deprecated
674+
version: "1.34"
665675
- name: KubeletRegistrationGetOnExistsOnly
666676
versionedSpecs:
667677
- default: false

0 commit comments

Comments
 (0)