[release-4.19] controller: sched: conditionally enable Shared informer by default #1932

shajmakh · 2025-09-04T10:19:06Z

openshift/kubernetes#2391 has landed in
openshift, which means that the behavior described is enabled by default and we want
the scheduler to adapt to this behavior by default iff the user didn't
explicitly set the informer mode in the CR.

Until now, the scheduler works in a dedicated informer mode which is meant to take into
account the pods in terminal state (such as pods that ran and completed)
in the PFP computation to ensure it matches the PFP computed by the RTE
and reported in NRT.

The intended behavior which the new kubelet behavior is about ignoring
such pods and accounting only for active pods, so if this behavior is
enabled in the RTE-NRT, while kept the default dedicated in the
scheduler there will be misalignment in the computed vs the expected PFP
from scheduler's POV vs NRT's POV and a scheduler stall will happen that
will never recover.

In this commit we adjust the informer default value to Shared (instead
of Dedicated) only if both below conditions are met:

the cluster version supports the fixed kubelet which is met if the
cluster version is equal or greater to the known-to-be fixed OCP version.
the user didn't set the Spec.SchedulerInformer field in the NRS CR
This modification will enable the shared mode which in turn includes only
running pods (= active pods) in the PFP computation from POV allowing
ultimately PFP alignment with NRT.

Signed-off-by: Shereen Haj [email protected]
(cherry picked from commit 6a56840)

The above is the core of this PR and it is followed by related enhancements as described in the different commits that were cherry-picked into this PR. The commits form a complete follow-up for the core commit and that is why I chose to put them in one PR. Please review each one to obtain more details about the changes.

Original PRs:
#1315
#1715
#1734
#1915

Changes specific to 4.19

These involve modifications to 4.19 build numbers and which is derived from the fact that 4.19 is GA and the fix landed in a zstream. Thus, there is no dev-preview nor release candidate builds for already GA releases.

openshift/kubernetes#2391 has landed in openshift, which means that the behavior described is enabled by default and we want the scheduler to adapt to this behavior by default iff the user didn't explicitly set the informer mode in the CR. Until now, the scheduler works in a dedicated informer mode which is meant to take into account the pods in terminal state (such as pods that ran and completed) in the PFP computation to ensure it matches the PFP computed by the RTE and reported in NRT. The intended behavior which the new kubelet behavior is about ignoring such pods and accounting only for active pods, so if this behavior is enabled in the RTE-NRT, while kept the default dedicated in the scheduler there will be misalignment in the computed vs the expected PFP from scheduler's POV vs NRT's POV and a scheduler stall will happen that will never recover. In this commit we adjust the informer default value to Shared (instead of Dedicated) only if both below conditions are met: 1. the cluster version supports the fixed kubelet which is met if the cluster version is equal or greater to the known-to-be fixed OCP version. 2. the user didn't set the Spec.SchedulerInformer field in the NRS CR This modification will enable the shared mode which in turn includes only running pods (= active pods) in the PFP computation from POV allowing ultimately PFP alignment with NRT. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 6a56840)

syncNUMASchedulerResources() is constructing the sched status and later in the reconcile loop this status is overriding the main instance status. Concerning status conditions, the only situation where the conditions are not set is on first creation of the scheduler CR, the rest of the cases the conditions are there but they are continuously overridden with nil in syncNUMASchedulerResources(). This in turn leads to updating the conditions in r.updateStatus() whenever it is invoked regardless of the actual existing conditions. In this commit, ensure that the original current status is the base of the new one. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit f3893f3)

So far updatinf the status was done in every reconcilation iteration even if the spec fields reflected the same values. This commits saves a copy of the initial resource status and uses it as a reference to identify if the resource requires a status update or not and act accordingly. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 558833b)

We want to be able to see the informer mode of the scheduler as a preparation step for having the informer default change according to the kubelet version. This commit enables reflecting the informer value (default or configured) as part of the status conditions which is less direct than reporting it in spec and is valid. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 09d76b9)

The fix has landed in OCP 4.19, make sure this is reflected to consume the new default of schedulerInformer in case it is not set by the user. Previously the version parsing relied on the assumption that the platform version is of type stable (e.i 4.20.1, 4.20.2 ..) but this is incorrect because the cluster may be deployed with different types of builds and each requires its own processing logic, which must know the first supported build for each type. Ensure that the different types are recognized and processed as required. Design notes: moved platforminfo in a separate package, still internal. Previously, it was a private controller helper. Moved into a new packge mostly because didn't want to pollute controllers with version check. Added a Properties sub-struct vs add straight into the PlatformInfo the value. This is debatable. We don't plan to add more properties, so why to bother? Thing is, I can't remember a time on which I added a specific flag or field top-level and never regret later on. At very least we have namespaced properties which is nicer. I acknowledge the tension between not overcomplicate things, not do things "just in case" (there is no future expansion on the radar) but still if we don't namespace things early on, it's harder to retrofit. So went with this approach this time, but again, is debatable. The same line of thought lead to implement the version check in a specific, non-generic way tailor made for this case, exactly because we don't see more usecases coming so generalizing felt overkill. The summarization is thus that the **internal** API was made extensible (even if we don't see extensions happening) because APIs are harder to change, so "just in case" has more merit here. The implementation, on the other hand, is simpler to change and addresses only the problem at hand. PlatformInfo is heavily overused. Should probably be ClusterInfo. Co-authored-by: Shereen Haj <[email protected]> Signed-off-by: Francesco Romani <[email protected]> (cherry picked from commit b4608c3)

ffromani

/approve

minorish thing

ffromani · 2025-09-05T11:01:59Z

internal/platforminfo/platforminfo_test.go

+		{
+			description: "next Z-stream fixed, must never regress", // at time of writing
+			plat:        platform.OpenShift,
+			ver:         mustParseVersion("4.20.1"),


this should be 4.19.11 or so

openshift-ci · 2025-09-05T11:02:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani,shajmakh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shajmakh added 4 commits September 4, 2025 13:15

openshift-ci bot requested review from ffromani and swatisehgal September 4, 2025 10:19

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2025

shajmakh force-pushed the sched-adapt-4.19 branch from 743ec3d to 3024ef3 Compare September 4, 2025 13:10

ffromani reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release-4.19] controller: sched: conditionally enable Shared informer by default #1932

[release-4.19] controller: sched: conditionally enable Shared informer by default #1932

Uh oh!

shajmakh commented Sep 4, 2025 •

edited

Loading

Uh oh!

ffromani left a comment

Uh oh!

ffromani Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

Uh oh!

[release-4.19] controller: sched: conditionally enable Shared informer by default #1932

Are you sure you want to change the base?

[release-4.19] controller: sched: conditionally enable Shared informer by default #1932

Uh oh!

Conversation

shajmakh commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes specific to 4.19

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

ffromani Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

Uh oh!

shajmakh commented Sep 4, 2025 •

edited

Loading