Skip to content

Conversation

ffromani
Copy link
Member

add more test coverage for scheduler stalls, using targeted naked (= not managed by a controller) pods.
the key trigger for scheduler stall turned out to be guaranteed pods with restartPolicy != Always

@openshift-ci-robot
Copy link

@ffromani: No Jira issue with key CNF-18725 exists in the tracker at https://issues.redhat.com/.
Once a valid jira issue is referenced in the title of this pull request, request a refresh with /jira refresh.

In response to this:

add more test coverage for scheduler stalls, using targeted naked (= not managed by a controller) pods.
the key trigger for scheduler stall turned out to be guaranteed pods with restartPolicy != Always

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from shajmakh and swatisehgal June 10, 2025 08:46
Copy link
Contributor

openshift-ci bot commented Jun 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2025
@ffromani ffromani changed the title CNF-18725: CNF-18726: e2e: serial: more cache stall tests CNF-18275: CNF-18276: e2e: serial: more cache stall tests Jun 10, 2025
@openshift-ci-robot
Copy link

@ffromani: This pull request references CNF-18275 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

add more test coverage for scheduler stalls, using targeted naked (= not managed by a controller) pods.
the key trigger for scheduler stall turned out to be guaranteed pods with restartPolicy != Always

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ffromani
Copy link
Member Author

/jira refresh

@openshift-ci-robot
Copy link

@ffromani: This pull request references CNF-18275 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

in `internal/wait` we have a couple functions with inconsistent
signature:
- ForPodListAllRunning
- ForPodListAllDeleted

the former takes and return slices of pointers to `v1.Pod`; the latter
takes a slice of `v1.Pod`. This was likely an implementation oversight.

The problem however run deeper: what's the correct signature?
Let's take a step back.
`internal/podlist` builds upon `sigs.k8s.io`'s controller-runtime,
which uses` PodList to hold `Items`, which is a slice of `v1.Pod`.
There is no reason to differ here, so for us a `podlist` should be
akin to a slice of `v1.Pod`.

Thus, functions which operate on a slice of pointers to `v1.Pod` should
use a different name, like maybe `Pods`.

Going back to the original question, and considering all the above,
the most correct API seems to be the one using pointers.
`ForPodListAllDeleted` should thus be updated to ensure the return
values of `ForPodListAllRunning` can be used directly.

Finally, both functions should be renamed to move from PodList to Pods.

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the e2e-sched-stall-extra-tests-2 branch from 9bc3492 to 0825ac0 Compare June 11, 2025 16:42
@ffromani
Copy link
Member Author

/hold

let's wait for the fix for https://issues.redhat.com/browse/OCPBUGS-56785 to be available before to move forward

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2025
ffromani added 5 commits June 11, 2025 18:54
A common ginkgo gotcha is to initialize variables on
var definition. We should never do that.
Always initialize in a BeforeEach block.

Signed-off-by: Francesco Romani <[email protected]>
Pod event checker should be a klog message, not a By message,
to reduce noise in the runs.

Signed-off-by: Francesco Romani <[email protected]>
remove conflicting redundant labels. The labels should
be moved in the highest meaningful level.

Signed-off-by: Francesco Romani <[email protected]>
Add more stall tests using targeted naked (not using jobs) pods.

Signed-off-by: Francesco Romani <[email protected]>
clarify logs for a better troubleshoot experience.
No intended changes in behavior.

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the e2e-sched-stall-extra-tests-2 branch from 0825ac0 to 4a6a874 Compare June 11, 2025 16:56
Comment on lines +327 to +406
Entry("with non-restartable pods with guaranteed pod with partially terminated app containers", context.TODO(), func(podNamespace string, idx int) *corev1.Pod {
return &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Namespace: podNamespace,
GenerateName: "generic-pause-pod-",
Labels: map[string]string{
"test-pod-index": strconv.FormatInt(int64(idx), 10),
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "generic-main-cnt-run-1",
Image: images.GetPauseImage(),
Command: []string{"/bin/sleep"},
Args: []string{"42d"}, // "forever" in our test timescale
Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("256Mi"),
},
},
},
{
Name: "generic-main-cnt-run-2",
Image: images.GetPauseImage(),
Command: []string{"/bin/sleep"},
Args: []string{"1s"},
Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("256Mi"),
},
},
},
},
RestartPolicy: corev1.RestartPolicyNever,
},
}
}),
Entry("with non-restartable pods with guaranteed pod with partially terminated app containers", context.TODO(), func(podNamespace string, idx int) *corev1.Pod {
return &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Namespace: podNamespace,
GenerateName: "generic-pause-pod-",
Labels: map[string]string{
"test-pod-index": strconv.FormatInt(int64(idx), 10),
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "generic-main-cnt-run-1",
Image: images.GetPauseImage(),
Command: []string{"/bin/sleep"},
Args: []string{"42d"}, // "forever" in our test timescale
Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("256Mi"),
},
},
},
{
Name: "generic-main-cnt-run-2",
Image: images.GetPauseImage(),
Command: []string{"/bin/sleep"},
Args: []string{"1s"},
Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("256Mi"),
},
},
},
},
RestartPolicy: corev1.RestartPolicyNever,
},
}
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two entries are identical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I need to rebase anyway so I'll fix with the next upload

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants