-
Notifications
You must be signed in to change notification settings - Fork 18
[release-4.19] controller: sched: conditionally enable Shared informer by default #1932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-4.19
Are you sure you want to change the base?
Conversation
openshift/kubernetes#2391 has landed in openshift, which means that the behavior described is enabled by default and we want the scheduler to adapt to this behavior by default iff the user didn't explicitly set the informer mode in the CR. Until now, the scheduler works in a dedicated informer mode which is meant to take into account the pods in terminal state (such as pods that ran and completed) in the PFP computation to ensure it matches the PFP computed by the RTE and reported in NRT. The intended behavior which the new kubelet behavior is about ignoring such pods and accounting only for active pods, so if this behavior is enabled in the RTE-NRT, while kept the default dedicated in the scheduler there will be misalignment in the computed vs the expected PFP from scheduler's POV vs NRT's POV and a scheduler stall will happen that will never recover. In this commit we adjust the informer default value to Shared (instead of Dedicated) only if both below conditions are met: 1. the cluster version supports the fixed kubelet which is met if the cluster version is equal or greater to the known-to-be fixed OCP version. 2. the user didn't set the Spec.SchedulerInformer field in the NRS CR This modification will enable the shared mode which in turn includes only running pods (= active pods) in the PFP computation from POV allowing ultimately PFP alignment with NRT. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 6a56840)
syncNUMASchedulerResources() is constructing the sched status and later in the reconcile loop this status is overriding the main instance status. Concerning status conditions, the only situation where the conditions are not set is on first creation of the scheduler CR, the rest of the cases the conditions are there but they are continuously overridden with nil in syncNUMASchedulerResources(). This in turn leads to updating the conditions in r.updateStatus() whenever it is invoked regardless of the actual existing conditions. In this commit, ensure that the original current status is the base of the new one. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit f3893f3)
So far updatinf the status was done in every reconcilation iteration even if the spec fields reflected the same values. This commits saves a copy of the initial resource status and uses it as a reference to identify if the resource requires a status update or not and act accordingly. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 558833b)
We want to be able to see the informer mode of the scheduler as a preparation step for having the informer default change according to the kubelet version. This commit enables reflecting the informer value (default or configured) as part of the status conditions which is less direct than reporting it in spec and is valid. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 09d76b9)
743ec3d
to
3024ef3
Compare
The fix has landed in OCP 4.19, make sure this is reflected to consume the new default of schedulerInformer in case it is not set by the user. Previously the version parsing relied on the assumption that the platform version is of type stable (e.i 4.20.1, 4.20.2 ..) but this is incorrect because the cluster may be deployed with different types of builds and each requires its own processing logic, which must know the first supported build for each type. Ensure that the different types are recognized and processed as required. Design notes: moved platforminfo in a separate package, still internal. Previously, it was a private controller helper. Moved into a new packge mostly because didn't want to pollute controllers with version check. Added a Properties sub-struct vs add straight into the PlatformInfo the value. This is debatable. We don't plan to add more properties, so why to bother? Thing is, I can't remember a time on which I added a specific flag or field top-level and never regret later on. At very least we have namespaced properties which is nicer. I acknowledge the tension between not overcomplicate things, not do things "just in case" (there is no future expansion on the radar) but still if we don't namespace things early on, it's harder to retrofit. So went with this approach this time, but again, is debatable. The same line of thought lead to implement the version check in a specific, non-generic way tailor made for this case, exactly because we don't see more usecases coming so generalizing felt overkill. The summarization is thus that the **internal** API was made extensible (even if we don't see extensions happening) because APIs are harder to change, so "just in case" has more merit here. The implementation, on the other hand, is simpler to change and addresses only the problem at hand. PlatformInfo is heavily overused. Should probably be ClusterInfo. Co-authored-by: Shereen Haj <[email protected]> Signed-off-by: Francesco Romani <[email protected]> (cherry picked from commit b4608c3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
minorish thing
{ | ||
description: "next Z-stream fixed, must never regress", // at time of writing | ||
plat: platform.OpenShift, | ||
ver: mustParseVersion("4.20.1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be 4.19.11 or so
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, shajmakh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
openshift/kubernetes#2391 has landed in
openshift, which means that the behavior described is enabled by default and we want
the scheduler to adapt to this behavior by default iff the user didn't
explicitly set the informer mode in the CR.
Until now, the scheduler works in a dedicated informer mode which is meant to take into
account the pods in terminal state (such as pods that ran and completed)
in the PFP computation to ensure it matches the PFP computed by the RTE
and reported in NRT.
The intended behavior which the new kubelet behavior is about ignoring
such pods and accounting only for active pods, so if this behavior is
enabled in the RTE-NRT, while kept the default dedicated in the
scheduler there will be misalignment in the computed vs the expected PFP
from scheduler's POV vs NRT's POV and a scheduler stall will happen that
will never recover.
In this commit we adjust the informer default value to Shared (instead
of Dedicated) only if both below conditions are met:
cluster version is equal or greater to the known-to-be fixed OCP version.
This modification will enable the shared mode which in turn includes only
running pods (= active pods) in the PFP computation from POV allowing
ultimately PFP alignment with NRT.
Signed-off-by: Shereen Haj [email protected]
(cherry picked from commit 6a56840)
The above is the core of this PR and it is followed by related enhancements as described in the different commits that were cherry-picked into this PR. The commits form a complete follow-up for the core commit and that is why I chose to put them in one PR. Please review each one to obtain more details about the changes.
Original PRs:
#1315
#1715
#1734
#1915
Changes specific to 4.19
These involve modifications to 4.19 build numbers and which is derived from the fact that 4.19 is GA and the fix landed in a zstream. Thus, there is no dev-preview nor release candidate builds for already GA releases.