Skip to content

Conversation

shajmakh
Copy link
Member

@shajmakh shajmakh commented Sep 4, 2025

openshift/kubernetes#2391 has landed in
openshift, which means that the behavior described is enabled by default and we want
the scheduler to adapt to this behavior by default iff the user didn't
explicitly set the informer mode in the CR.

Until now, the scheduler works in a dedicated informer mode which is meant to take into
account the pods in terminal state (such as pods that ran and completed)
in the PFP computation to ensure it matches the PFP computed by the RTE
and reported in NRT.

The intended behavior which the new kubelet behavior is about ignoring
such pods and accounting only for active pods, so if this behavior is
enabled in the RTE-NRT, while kept the default dedicated in the
scheduler there will be misalignment in the computed vs the expected PFP
from scheduler's POV vs NRT's POV and a scheduler stall will happen that
will never recover.

In this commit we adjust the informer default value to Shared (instead
of Dedicated) only if both below conditions are met:

  1. the cluster version supports the fixed kubelet which is met if the
    cluster version is equal or greater to the known-to-be fixed OCP version.
  2. the user didn't set the Spec.SchedulerInformer field in the NRS CR
    This modification will enable the shared mode which in turn includes only
    running pods (= active pods) in the PFP computation from POV allowing
    ultimately PFP alignment with NRT.

Signed-off-by: Shereen Haj [email protected]
(cherry picked from commit 6a56840)

The above is the core of this PR and it is followed by related enhancements as described in the different commits that were cherry-picked into this PR. The commits form a complete follow-up for the core commit and that is why I chose to put them in one PR. Please review each one to obtain more details about the changes.

Original PRs:
#1315
#1715
#1734
#1915

Changes specific to 4.19

These involve modifications to 4.19 build numbers and which is derived from the fact that 4.19 is GA and the fix landed in a zstream. Thus, there is no dev-preview nor release candidate builds for already GA releases.

openshift/kubernetes#2391 has landed in
openshift, which means that the behavior described is enabled by default and we want
the scheduler to adapt to this behavior by default iff the user didn't
explicitly set the informer mode in the CR.

Until now, the scheduler works in a dedicated informer mode which is meant to take into
account the pods in terminal state (such as pods that ran and completed)
in the PFP computation to ensure it matches the PFP computed by the RTE
and reported in NRT.

The intended behavior which the new kubelet behavior is about ignoring
such pods and accounting only for active pods, so if this behavior is
enabled in the RTE-NRT, while kept the default dedicated in the
scheduler there will be misalignment in the computed vs the expected PFP
from scheduler's POV vs NRT's POV and a scheduler stall will happen that
will never recover.

In this commit we adjust the informer default value to Shared (instead
of Dedicated) only if both below conditions are met:
1. the cluster version supports the fixed kubelet which is met if the
   cluster version is equal or greater to the known-to-be fixed OCP version.
2. the user didn't set the Spec.SchedulerInformer field in the NRS CR
This modification will enable the shared mode which in turn includes only
running pods (= active pods) in the PFP computation from POV allowing
ultimately PFP alignment with NRT.

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit 6a56840)
syncNUMASchedulerResources() is constructing the sched status and later
in the reconcile loop this status is overriding the main instance status.
Concerning status conditions, the only situation where the
conditions are not set is on first creation of the scheduler CR,
the rest of the cases the conditions are there but they are continuously
overridden with nil in syncNUMASchedulerResources(). This in turn leads
to updating the conditions in r.updateStatus() whenever it is invoked
regardless of the actual existing conditions.

In this commit, ensure that the original current status is the base of
the new one.

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit f3893f3)
So far updatinf the status was done in every reconcilation iteration
even if the spec fields reflected the same values.

This commits saves a copy of the initial resource status and uses it as
a reference to identify if the resource requires a status update or not
and act accordingly.

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit 558833b)
We want to be able to see the informer mode of the scheduler as a
preparation step for having the informer default change according to the
kubelet version.
This commit enables reflecting the informer value (default or
configured) as part of the status conditions which is less direct than
reporting it in spec and is valid.

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit 09d76b9)
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2025
The fix has landed in OCP 4.19, make sure this is reflected to
consume the new default of schedulerInformer in case it is not set by
the user.

Previously the version parsing relied on the assumption that the
platform version is of type stable (e.i 4.20.1, 4.20.2 ..) but this is
incorrect because the cluster may be deployed with different types of
builds and each requires its own processing logic, which must know the
first supported build for each type. Ensure that the different types are
recognized and processed as required.

Design notes:
moved platforminfo in a separate package, still internal.
Previously, it was a private controller helper. Moved into a new packge
mostly because didn't want to pollute controllers with version check.

Added a Properties sub-struct vs add straight into the PlatformInfo
the value. This is debatable. We don't plan to add more properties,
so why to bother? Thing is, I can't remember a time on which I added
a specific flag or field top-level and never regret later on.
At very least we have namespaced properties which is nicer.
I acknowledge the tension between not overcomplicate things, not do
things "just in case" (there is no future expansion on the radar)
but still if we don't namespace things early on, it's harder to
retrofit. So went with this approach this time, but again, is debatable.

The same line of thought lead to implement the version check in a
specific, non-generic way tailor made for this case, exactly because
we don't see more usecases coming so generalizing felt overkill.

The summarization is thus that the **internal** API was made extensible
(even if we don't see extensions happening) because APIs are harder to
change, so "just in case" has more merit here. The implementation, on
the other hand, is simpler to change and addresses only the problem at
hand.

PlatformInfo is heavily overused. Should probably be ClusterInfo.

Co-authored-by: Shereen Haj <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
(cherry picked from commit b4608c3)
Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

minorish thing

{
description: "next Z-stream fixed, must never regress", // at time of writing
plat: platform.OpenShift,
ver: mustParseVersion("4.20.1"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 4.19.11 or so

Copy link
Contributor

openshift-ci bot commented Sep 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants