CNF-18275: CNF-18276: e2e: serial: more cache stall tests #1378

ffromani · 2025-06-10T08:46:46Z

add more test coverage for scheduler stalls, using targeted naked (= not managed by a controller) pods.
the key trigger for scheduler stall turned out to be guaranteed pods with restartPolicy != Always

openshift-ci-robot · 2025-06-10T08:46:48Z

@ffromani: No Jira issue with key CNF-18725 exists in the tracker at https://issues.redhat.com/.
Once a valid jira issue is referenced in the title of this pull request, request a refresh with /jira refresh.

In response to this:

add more test coverage for scheduler stalls, using targeted naked (= not managed by a controller) pods.
the key trigger for scheduler stall turned out to be guaranteed pods with restartPolicy != Always

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-06-10T08:47:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-06-10T08:47:20Z

@ffromani: This pull request references CNF-18275 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

add more test coverage for scheduler stalls, using targeted naked (= not managed by a controller) pods.
the key trigger for scheduler stall turned out to be guaranteed pods with restartPolicy != Always

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ffromani · 2025-06-10T08:47:24Z

/jira refresh

openshift-ci-robot · 2025-06-10T08:47:28Z

@ffromani: This pull request references CNF-18275 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

in `internal/wait` we have a couple functions with inconsistent signature: - ForPodListAllRunning - ForPodListAllDeleted the former takes and return slices of pointers to `v1.Pod`; the latter takes a slice of `v1.Pod`. This was likely an implementation oversight. The problem however run deeper: what's the correct signature? Let's take a step back. `internal/podlist` builds upon `sigs.k8s.io`'s controller-runtime, which uses` PodList to hold `Items`, which is a slice of `v1.Pod`. There is no reason to differ here, so for us a `podlist` should be akin to a slice of `v1.Pod`. Thus, functions which operate on a slice of pointers to `v1.Pod` should use a different name, like maybe `Pods`. Going back to the original question, and considering all the above, the most correct API seems to be the one using pointers. `ForPodListAllDeleted` should thus be updated to ensure the return values of `ForPodListAllRunning` can be used directly. Finally, both functions should be renamed to move from PodList to Pods. Signed-off-by: Francesco Romani <[email protected]>

ffromani · 2025-06-11T16:42:48Z

/hold

let's wait for the fix for https://issues.redhat.com/browse/OCPBUGS-56785 to be available before to move forward

A common ginkgo gotcha is to initialize variables on var definition. We should never do that. Always initialize in a BeforeEach block. Signed-off-by: Francesco Romani <[email protected]>

Pod event checker should be a klog message, not a By message, to reduce noise in the runs. Signed-off-by: Francesco Romani <[email protected]>

remove conflicting redundant labels. The labels should be moved in the highest meaningful level. Signed-off-by: Francesco Romani <[email protected]>

Add more stall tests using targeted naked (not using jobs) pods. Signed-off-by: Francesco Romani <[email protected]>

clarify logs for a better troubleshoot experience. No intended changes in behavior. Signed-off-by: Francesco Romani <[email protected]>

rshemtov13 · 2025-07-30T12:22:04Z

test/e2e/serial/tests/scheduler_cache_stall.go

+				Entry("with non-restartable pods with guaranteed pod with partially terminated app containers", context.TODO(), func(podNamespace string, idx int) *corev1.Pod {
+					return &corev1.Pod{
+						ObjectMeta: metav1.ObjectMeta{
+							Namespace:    podNamespace,
+							GenerateName: "generic-pause-pod-",
+							Labels: map[string]string{
+								"test-pod-index": strconv.FormatInt(int64(idx), 10),
+							},
+						},
+						Spec: corev1.PodSpec{
+							Containers: []corev1.Container{
+								{
+									Name:    "generic-main-cnt-run-1",
+									Image:   images.GetPauseImage(),
+									Command: []string{"/bin/sleep"},
+									Args:    []string{"42d"}, // "forever" in our test timescale
+									Resources: corev1.ResourceRequirements{
+										Limits: corev1.ResourceList{
+											corev1.ResourceCPU:    resource.MustParse("100m"),
+											corev1.ResourceMemory: resource.MustParse("256Mi"),
+										},
+									},
+								},
+								{
+									Name:    "generic-main-cnt-run-2",
+									Image:   images.GetPauseImage(),
+									Command: []string{"/bin/sleep"},
+									Args:    []string{"1s"},
+									Resources: corev1.ResourceRequirements{
+										Limits: corev1.ResourceList{
+											corev1.ResourceCPU:    resource.MustParse("100m"),
+											corev1.ResourceMemory: resource.MustParse("256Mi"),
+										},
+									},
+								},
+							},
+							RestartPolicy: corev1.RestartPolicyNever,
+						},
+					}
+				}),
+				Entry("with non-restartable pods with guaranteed pod with partially terminated app containers", context.TODO(), func(podNamespace string, idx int) *corev1.Pod {
+					return &corev1.Pod{
+						ObjectMeta: metav1.ObjectMeta{
+							Namespace:    podNamespace,
+							GenerateName: "generic-pause-pod-",
+							Labels: map[string]string{
+								"test-pod-index": strconv.FormatInt(int64(idx), 10),
+							},
+						},
+						Spec: corev1.PodSpec{
+							Containers: []corev1.Container{
+								{
+									Name:    "generic-main-cnt-run-1",
+									Image:   images.GetPauseImage(),
+									Command: []string{"/bin/sleep"},
+									Args:    []string{"42d"}, // "forever" in our test timescale
+									Resources: corev1.ResourceRequirements{
+										Limits: corev1.ResourceList{
+											corev1.ResourceCPU:    resource.MustParse("100m"),
+											corev1.ResourceMemory: resource.MustParse("256Mi"),
+										},
+									},
+								},
+								{
+									Name:    "generic-main-cnt-run-2",
+									Image:   images.GetPauseImage(),
+									Command: []string{"/bin/sleep"},
+									Args:    []string{"1s"},
+									Resources: corev1.ResourceRequirements{
+										Limits: corev1.ResourceList{
+											corev1.ResourceCPU:    resource.MustParse("100m"),
+											corev1.ResourceMemory: resource.MustParse("256Mi"),
+										},
+									},
+								},
+							},
+							RestartPolicy: corev1.RestartPolicyNever,
+						},
+					}
+				}),


These two entries are identical

thanks, I need to rebase anyway so I'll fix with the next upload

openshift-ci bot requested review from shajmakh and swatisehgal June 10, 2025 08:46

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2025

ffromani changed the title ~~CNF-18725: CNF-18726: e2e: serial: more cache stall tests~~ CNF-18275: CNF-18276: e2e: serial: more cache stall tests Jun 10, 2025

ffromani force-pushed the e2e-sched-stall-extra-tests-2 branch from 9bc3492 to 0825ac0 Compare June 11, 2025 16:42

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2025

ffromani added 5 commits June 11, 2025 18:54

e2e: serial: ginkgo: always initialize in BeforeEach

a4a7cae

A common ginkgo gotcha is to initialize variables on var definition. We should never do that. Always initialize in a BeforeEach block. Signed-off-by: Francesco Romani <[email protected]>

test: internal: declass logs

587a5ff

Pod event checker should be a klog message, not a By message, to reduce noise in the runs. Signed-off-by: Francesco Romani <[email protected]>

e2e: serial: cache: stall: remove conflicting labels

8c54615

remove conflicting redundant labels. The labels should be moved in the highest meaningful level. Signed-off-by: Francesco Romani <[email protected]>

e2e: serial: cache: stall: more stall tests

18408b1

Add more stall tests using targeted naked (not using jobs) pods. Signed-off-by: Francesco Romani <[email protected]>

e2e: serial: nonreg: improve logs

4a6a874

clarify logs for a better troubleshoot experience. No intended changes in behavior. Signed-off-by: Francesco Romani <[email protected]>

ffromani force-pushed the e2e-sched-stall-extra-tests-2 branch from 0825ac0 to 4a6a874 Compare June 11, 2025 16:56

rshemtov13 reviewed Jul 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNF-18275: CNF-18276: e2e: serial: more cache stall tests #1378

CNF-18275: CNF-18276: e2e: serial: more cache stall tests #1378

Uh oh!

ffromani commented Jun 10, 2025

Uh oh!

openshift-ci-robot commented Jun 10, 2025

Uh oh!

openshift-ci bot commented Jun 10, 2025

Uh oh!

openshift-ci-robot commented Jun 10, 2025

Uh oh!

ffromani commented Jun 10, 2025

Uh oh!

openshift-ci-robot commented Jun 10, 2025

Uh oh!

ffromani commented Jun 11, 2025

Uh oh!

rshemtov13 Jul 30, 2025

Uh oh!

ffromani Jul 30, 2025

Uh oh!

Uh oh!

CNF-18275: CNF-18276: e2e: serial: more cache stall tests #1378

Are you sure you want to change the base?

CNF-18275: CNF-18276: e2e: serial: more cache stall tests #1378

Uh oh!

Conversation

ffromani commented Jun 10, 2025

Uh oh!

openshift-ci-robot commented Jun 10, 2025

Uh oh!

openshift-ci bot commented Jun 10, 2025

Uh oh!

openshift-ci-robot commented Jun 10, 2025

Uh oh!

ffromani commented Jun 10, 2025

Uh oh!

openshift-ci-robot commented Jun 10, 2025

Uh oh!

ffromani commented Jun 11, 2025

Uh oh!

rshemtov13 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

ffromani Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!