UPSTREAM: 132028: podresources: list: use active pods in list

ffromani · ffromani · commit 4aa7318f763d · 2025-08-04T08:29:03.000+02:00
The podresources API List implementation uses the internal data of the resource managers as source of truth. Looking at the implementation here: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60 we take care of syncing the device allocation data before querying the device manager to return its pod->devices assignment. This is needed because otherwise the device manager (and all the other resource managers) would do the cleanup asynchronously, so the `List` call will return incorrect data. But we don't do this syncing neither for CPUs or for memory, so when we report these we will get stale data as the issue kubernetes#132020 demonstrates. For CPU manager, we however have the reconcile loop which cleans the stale data periodically. Turns out this timing interplay was actually the reason the existing issue kubernetes#119423 seemed fixed (see: kubernetes#119423 (comment)). But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time very high (>= 5 minutes), then the issue still reproduces against current master branch (https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983). Taking a step back, we can see multiple problems: 1. not syncing the resource managers internal data before to query for pod assignment (no removeStaleState calls) but most importantly 2. the List call iterate overs all the pod known to the kubelet. But the resource managers do NOT hold resources for non-running pod, so it is better, actually it's correct to iterate only over the active pods. This will also avoid issue 1 above. Furthermore, the resource managers all iterate over the active pods anyway: `List` is using all the pods known about: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L3135 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/pod/pod_manager.go#L215 But all the resource managers are using the list of active pods: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L1666 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet_pods.go#L198 So this change will also make the `List` view consistent with the resource managers view, which is also a promise of the API currently broken. We also need to acknowledge the the warning in the docstring of GetActivePods. Arguably, having the endpoint using a different podset wrt the resource managers with the related desync causes way more harm than good. And arguably, it's better to fix this issue in just one place instead of having the `List` use a different pod set for unclear reason. For these reasons, while important, I don't think the warning per se invalidated this change. We need to further acknowledge the `List` endpoint used the full pod list since its inception. So, we will add a Feature Gate to disable this fix and restore the old behavior. We plan to keep this Feature Gate for quite a long time (at least 4 more releases) considering how stable this change was. Should a consumer of the API being broken by this change, we have the option to restore the old behavior and to craft a more elaborate fix. The old `v1alpha1` endpoint will be not modified intentionally. ***RELEASE-4.19 BACKPORT NOTE*** dropped the versioned feature gate entry as we don't have the versioned geature gates in this version. Signed-off-by: Francesco Romani <fromani@redhat.com>
diff --git a/pkg/features/kube_features.go b/pkg/features/kube_features.go
@@ -370,6 +370,27 @@ const (
 	// Enable POD resources API with Get method
 	KubeletPodResourcesGet featuregate.Feature = "KubeletPodResourcesGet"
 
+	// owner: @ffromani
+	// Deprecated: v1.34
+	//
+	// issue: https://github.com/kubernetes/kubernetes/issues/119423
+	// Disables restricted output for the podresources API list endpoint.
+	// "Restricted" output only includes the pods which are actually running and thus they
+	// hold resources. Turns out this was originally the intended behavior, see:
+	// https://github.com/kubernetes/kubernetes/pull/79409#issuecomment-505975671
+	// This behavior was lost over time and interaction with memory manager creates
+	// an unfixable bug because the endpoint returns spurious stale information the clients
+	// cannot filter out, because the API doesn't provide enough context. See:
+	// https://github.com/kubernetes/kubernetes/issues/132020
+	// The endpoint has returning extra information for long time, but that information
+	// is also useless for the purpose of this API. Nevertheless, we are changing a long-established
+	// albeit buggy behavior, so users observing any regressions can use the
+	// KubeletPodResourcesListUseActivePods/ feature gate (default on) to restore the old behavior.
+	// Please file issues if you hit issues and have to use this Feature Gate.
+	// The Feature Gate will be locked to true in +4 releases (1.38) and then removed (1.39)
+	// if there are no bug reported.
+	KubeletPodResourcesListUseActivePods featuregate.Feature = "KubeletPodResourcesListUseActivePods"
+
 	// owner: @kannon92
 	// kep: https://kep.k8s.io/4191
 	//
diff --git a/pkg/kubelet/apis/podresources/server_v1.go b/pkg/kubelet/apis/podresources/server_v1.go
@@ -22,6 +22,7 @@ import (
 
 	v1 "k8s.io/api/core/v1"
 	utilfeature "k8s.io/apiserver/pkg/util/feature"
+	"k8s.io/klog/v2"
 	podutil "k8s.io/kubernetes/pkg/api/v1/pod"
 	kubefeatures "k8s.io/kubernetes/pkg/features"
 	"k8s.io/kubernetes/pkg/kubelet/metrics"
@@ -36,17 +37,21 @@ type v1PodResourcesServer struct {
 	cpusProvider             CPUsProvider
 	memoryProvider           MemoryProvider
 	dynamicResourcesProvider DynamicResourcesProvider
+	useActivePods            bool
 }
 
 // NewV1PodResourcesServer returns a PodResourcesListerServer which lists pods provided by the PodsProvider
 // with device information provided by the DevicesProvider
 func NewV1PodResourcesServer(providers PodResourcesProviders) podresourcesv1.PodResourcesListerServer {
+	useActivePods := utilfeature.DefaultFeatureGate.Enabled(kubefeatures.KubeletPodResourcesListUseActivePods)
+	klog.InfoS("podresources", "method", "list", "useActivePods", useActivePods)
 	return &v1PodResourcesServer{
 		podsProvider:             providers.Pods,
 		devicesProvider:          providers.Devices,
 		cpusProvider:             providers.Cpus,
 		memoryProvider:           providers.Memory,
 		dynamicResourcesProvider: providers.DynamicResources,
+		useActivePods:            useActivePods,
 	}
 }
 
@@ -55,7 +60,13 @@ func (p *v1PodResourcesServer) List(ctx context.Context, req *podresourcesv1.Lis
 	metrics.PodResourcesEndpointRequestsTotalCount.WithLabelValues("v1").Inc()
 	metrics.PodResourcesEndpointRequestsListCount.WithLabelValues("v1").Inc()
 
-	pods := p.podsProvider.GetPods()
+	var pods []*v1.Pod
+	if p.useActivePods {
+		pods = p.podsProvider.GetActivePods()
+	} else {
+		pods = p.podsProvider.GetPods()
+	}
+
 	podResources := make([]*podresourcesv1.PodResources, len(pods))
 	p.devicesProvider.UpdateAllocatedDevices()
 
diff --git a/pkg/kubelet/apis/podresources/server_v1_test.go b/pkg/kubelet/apis/podresources/server_v1_test.go
@@ -19,10 +19,12 @@ package podresources
 import (
 	"context"
 	"fmt"
+	"sort"
 	"testing"
 
 	"github.com/google/go-cmp/cmp"
 	"github.com/google/go-cmp/cmp/cmpopts"
+	"github.com/stretchr/testify/mock"
 
 	v1 "k8s.io/api/core/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@ -221,6 +223,7 @@ func TestListPodResourcesV1(t *testing.T) {
 			mockDynamicResourcesProvider := podresourcetest.NewMockDynamicResourcesProvider(t)
 
 			mockPodsProvider.EXPECT().GetPods().Return(tc.pods).Maybe()
+			mockPodsProvider.EXPECT().GetActivePods().Return(tc.pods).Maybe()
 			mockDevicesProvider.EXPECT().GetDevices(string(podUID), containerName).Return(tc.devices).Maybe()
 			mockCPUsProvider.EXPECT().GetCPUs(string(podUID), containerName).Return(tc.cpus).Maybe()
 			mockMemoryProvider.EXPECT().GetMemory(string(podUID), containerName).Return(tc.memory).Maybe()
@@ -249,6 +252,159 @@ func TestListPodResourcesV1(t *testing.T) {
 	}
 }
 
+func makePod(idx int) *v1.Pod {
+	podNamespace := "pod-namespace"
+	podName := fmt.Sprintf("pod-name-%d", idx)
+	podUID := types.UID(fmt.Sprintf("pod-uid-%d", idx))
+	containerName := fmt.Sprintf("container-name-%d", idx)
+	containers := []v1.Container{
+		{
+			Name: containerName,
+		},
+	}
+	return &v1.Pod{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      podName,
+			Namespace: podNamespace,
+			UID:       podUID,
+		},
+		Spec: v1.PodSpec{
+			Containers: containers,
+		},
+	}
+}
+
+func collectNamespacedNamesFromPods(pods []*v1.Pod) []string {
+	ret := make([]string, 0, len(pods))
+	for _, pod := range pods {
+		ret = append(ret, pod.Namespace+"/"+pod.Name)
+	}
+	sort.Strings(ret)
+	return ret
+}
+
+func collectNamespacedNamesFromPodResources(prs []*podresourcesapi.PodResources) []string {
+	ret := make([]string, 0, len(prs))
+	for _, pr := range prs {
+		ret = append(ret, pr.Namespace+"/"+pr.Name)
+	}
+	sort.Strings(ret)
+	return ret
+}
+
+func TestListPodResourcesUsesOnlyActivePodsV1(t *testing.T) {
+	numaID := int64(1)
+
+	// we abuse the fact that we don't care about the assignments,
+	// so we reuse the same for all pods which is actually wrong.
+	devs := []*podresourcesapi.ContainerDevices{
+		{
+			ResourceName: "resource",
+			DeviceIds:    []string{"dev0"},
+			Topology:     &podresourcesapi.TopologyInfo{Nodes: []*podresourcesapi.NUMANode{{ID: numaID}}},
+		},
+	}
+
+	cpus := []int64{1, 9}
+
+	mems := []*podresourcesapi.ContainerMemory{
+		{
+			MemoryType: "memory",
+			Size_:      1073741824,
+			Topology:   &podresourcesapi.TopologyInfo{Nodes: []*podresourcesapi.NUMANode{{ID: numaID}}},
+		},
+		{
+			MemoryType: "hugepages-1Gi",
+			Size_:      1073741824,
+			Topology:   &podresourcesapi.TopologyInfo{Nodes: []*podresourcesapi.NUMANode{{ID: numaID}}},
+		},
+	}
+
+	for _, tc := range []struct {
+		desc       string
+		pods       []*v1.Pod
+		activePods []*v1.Pod
+	}{
+		{
+			desc:       "no pods",
+			pods:       []*v1.Pod{},
+			activePods: []*v1.Pod{},
+		},
+		{
+			desc: "no differences",
+			pods: []*v1.Pod{
+				makePod(1),
+				makePod(2),
+				makePod(3),
+				makePod(4),
+				makePod(5),
+			},
+			activePods: []*v1.Pod{
+				makePod(1),
+				makePod(2),
+				makePod(3),
+				makePod(4),
+				makePod(5),
+			},
+		},
+		{
+			desc: "some terminated pods",
+			pods: []*v1.Pod{
+				makePod(1),
+				makePod(2),
+				makePod(3),
+				makePod(4),
+				makePod(5),
+				makePod(6),
+				makePod(7),
+			},
+			activePods: []*v1.Pod{
+				makePod(1),
+				makePod(3),
+				makePod(4),
+				makePod(5),
+				makePod(6),
+			},
+		},
+	} {
+		t.Run(tc.desc, func(t *testing.T) {
+			mockDevicesProvider := podresourcetest.NewMockDevicesProvider(t)
+			mockPodsProvider := podresourcetest.NewMockPodsProvider(t)
+			mockCPUsProvider := podresourcetest.NewMockCPUsProvider(t)
+			mockMemoryProvider := podresourcetest.NewMockMemoryProvider(t)
+			mockDynamicResourcesProvider := podresourcetest.NewMockDynamicResourcesProvider(t)
+
+			mockPodsProvider.EXPECT().GetPods().Return(tc.pods).Maybe()
+			mockPodsProvider.EXPECT().GetActivePods().Return(tc.activePods).Maybe()
+			mockDevicesProvider.EXPECT().GetDevices(mock.Anything, mock.Anything).Return(devs).Maybe()
+			mockCPUsProvider.EXPECT().GetCPUs(mock.Anything, mock.Anything).Return(cpus).Maybe()
+			mockMemoryProvider.EXPECT().GetMemory(mock.Anything, mock.Anything).Return(mems).Maybe()
+			mockDevicesProvider.EXPECT().UpdateAllocatedDevices().Return().Maybe()
+			mockCPUsProvider.EXPECT().GetAllocatableCPUs().Return([]int64{}).Maybe()
+			mockDevicesProvider.EXPECT().GetAllocatableDevices().Return([]*podresourcesapi.ContainerDevices{}).Maybe()
+			mockMemoryProvider.EXPECT().GetAllocatableMemory().Return([]*podresourcesapi.ContainerMemory{}).Maybe()
+
+			providers := PodResourcesProviders{
+				Pods:             mockPodsProvider,
+				Devices:          mockDevicesProvider,
+				Cpus:             mockCPUsProvider,
+				Memory:           mockMemoryProvider,
+				DynamicResources: mockDynamicResourcesProvider,
+			}
+			server := NewV1PodResourcesServer(providers)
+			resp, err := server.List(context.TODO(), &podresourcesapi.ListPodResourcesRequest{})
+			if err != nil {
+				t.Errorf("want err = %v, got %q", nil, err)
+			}
+			expectedNames := collectNamespacedNamesFromPods(tc.activePods)
+			gotNames := collectNamespacedNamesFromPodResources(resp.GetPodResources())
+			if diff := cmp.Diff(expectedNames, gotNames, cmpopts.EquateEmpty()); diff != "" {
+				t.Fatal(diff)
+			}
+		})
+	}
+}
+
 func TestListPodResourcesWithInitContainersV1(t *testing.T) {
 	featuregatetesting.SetFeatureGateDuringTest(t, utilfeature.DefaultFeatureGate, pkgfeatures.KubeletPodResourcesDynamicResources, true)
 
@@ -530,6 +686,7 @@ func TestListPodResourcesWithInitContainersV1(t *testing.T) {
 			mockDynamicResourcesProvider := podresourcetest.NewMockDynamicResourcesProvider(t)
 
 			mockPodsProvider.EXPECT().GetPods().Return(tc.pods).Maybe()
+			mockPodsProvider.EXPECT().GetActivePods().Return(tc.pods).Maybe()
 			tc.mockFunc(tc.pods, mockDevicesProvider, mockCPUsProvider, mockMemoryProvider, mockDynamicResourcesProvider)
 
 			providers := PodResourcesProviders{
diff --git a/pkg/kubelet/apis/podresources/testing/pods_provider.go b/pkg/kubelet/apis/podresources/testing/pods_provider.go
diff --git a/pkg/kubelet/apis/podresources/types.go b/pkg/kubelet/apis/podresources/types.go
@@ -34,6 +34,7 @@ type DevicesProvider interface {
 
 // PodsProvider knows how to provide the pods admitted by the node
 type PodsProvider interface {
+	GetActivePods() []*v1.Pod
 	GetPods() []*v1.Pod
 	GetPodByName(namespace, name string) (*v1.Pod, bool)
 }
diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
@@ -3058,6 +3058,22 @@ func (kl *Kubelet) ListenAndServeReadOnly(address net.IP, port uint, tp trace.Tr
 	server.ListenAndServeKubeletReadOnlyServer(kl, kl.resourceAnalyzer, kl.containerManager.GetHealthCheckers(), address, port, tp)
 }
 
+type kubeletPodsProvider struct {
+	kl *Kubelet
+}
+
+func (pp *kubeletPodsProvider) GetActivePods() []*v1.Pod {
+	return pp.kl.GetActivePods()
+}
+
+func (pp *kubeletPodsProvider) GetPods() []*v1.Pod {
+	return pp.kl.podManager.GetPods()
+}
+
+func (pp *kubeletPodsProvider) GetPodByName(namespace, name string) (*v1.Pod, bool) {
+	return pp.kl.podManager.GetPodByName(namespace, name)
+}
+
 // ListenAndServePodResources runs the kubelet podresources grpc service
 func (kl *Kubelet) ListenAndServePodResources() {
 	endpoint, err := util.LocalEndpoint(kl.getPodResourcesDir(), podresources.Socket)
@@ -3067,7 +3083,7 @@ func (kl *Kubelet) ListenAndServePodResources() {
 	}
 
 	providers := podresources.PodResourcesProviders{
-		Pods:             kl.podManager,
+		Pods:             &kubeletPodsProvider{kl: kl},
 		Devices:          kl.containerManager,
 		Cpus:             kl.containerManager,
 		Memory:           kl.containerManager,
diff --git a/test/featuregates_linter/test_data/versioned_feature_list.yaml b/test/featuregates_linter/test_data/versioned_feature_list.yaml
@@ -662,6 +662,16 @@
     lockToDefault: false
     preRelease: Alpha
     version: "1.27"
+- name: KubeletPodResourcesListUseActivePods
+  versionedSpecs:
+  - default: false
+    lockToDefault: false
+    preRelease: GA
+    version: "1.0"
+  - default: true
+    lockToDefault: false
+    preRelease: Deprecated
+    version: "1.34"
 - name: KubeletRegistrationGetOnExistsOnly
   versionedSpecs:
   - default: false

Original file line number	Diff line number	Diff line change
`@@ -34,6 +34,7 @@ type DevicesProvider interface {`
`34`	`34`
`35`	`35`	`// PodsProvider knows how to provide the pods admitted by the node`
`36`	`36`	`type PodsProvider interface {`
	`37`	`+ GetActivePods() []*v1.Pod`
`37`	`38`	`GetPods() []*v1.Pod`
`38`	`39`	`GetPodByName(namespace, name string) (*v1.Pod, bool)`
`39`	`40`	`}`