[WIP]NO-JIRA: New rules about CO's conditions #2469

hongkailiu · 2025-09-05T20:30:10Z

This pull introduces a few new rules about Cluster Operator's conditions

Operators MUST not go Available=False or Degraded=True in an HA cluster during an uneventful CI upgrade. This has been exercised in CI OTA-700 and TRT-1578. It is expected to invest effort to deliver the fixes of those bugs.
Operators MUST complete their upgrade within 30m (with an exception of MCO within 90m) in a cluster up to 250 nodes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever an operator takes longer to upgrade than the given time. We plan to extend the CI coverage for this as well and spot violating cases and work on their fixes with the component team.

After this pull gets in, I will use API docs to update dev-guide.

The essence of the new rules is that operators MUST not go Available=False or Degraded=True in an HA cluster during an uneventful CI upgrade. Those rules have applied in CI for a while [1, 2] and OCPBugs have been filed in this area. In order to avoid CI failing, many exceptions have been added in the tests [3, 4] as many of those bugs are still open. It is expected to invest effort to deliver the fixes of those bugs. [1]. https://issues.redhat.com/browse/OTA-700 [2]. https://issues.redhat.com/browse/TRT-1578 [3]. https://github.com/openshift/origin/blob/2af38a7807699b3046a73f931884152a11271d21/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L102 [4]. openshift/origin#27231

The essence of the new rule is that operators MUST complete their upgrade within 30 minutes in a cluster up to 250 nodes in size, except for Machine Config Operator which has 90 minutes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever an operator takes longer to upgrade than the given time. [1]. openshift/cluster-version-operator#1165

openshift-ci-robot · 2025-09-05T20:30:14Z

@hongkailiu: This pull request explicitly references no jira issue.

In response to this:

This pull introduces a few new rules about Cluster Operator's conditions

Operators MUST not go Available=False or Degraded=True in an HA cluster during an uneventful CI upgrade. This has been exercised in CI OTA-700 and TRT-1578. It is expected
to invest effort to deliver the fixes of those bugs.

Operators MUST complete their upgrade within 30m (with an exception of MCO within 90m) in a cluster up to 250 nodes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever an operator takes longer to upgrade than the given time. We plan to extend the CI coverage for this as well and spot violating cases and work on their fixes with the component team.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-09-05T20:30:25Z

Hello @hongkailiu! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci · 2025-09-05T20:31:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu
Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2025-09-05T21:24:58Z

config/v1/types_cluster_operator.go

 	// is functional and available in the cluster. Available=False means at least
 	// part of the component is non-functional, and that the condition requires
 	// immediate administrator intervention.
+	// A component must not report unavailable during the course of a normal upgrade except it is a single-node cluster.


I don't think we need a single-node-cluster exception, because while they may satisfy "at least part of the component is non-functional" during a healthy update, I don't see how they'd trip "the condition requires immediate administrator intervention" during a healthy update. I understand that there's some ambiguous space around distinguishing "this isn't great, but might resolve on its own" and "this requires immediate admin intervention", but I think it's worth cluster operators investing in work to attempt that distinction, with "doesn't trip into Available=False during a CI update that completed smoothly and automatically without issues" as a minimal compliance level.

wking · 2025-09-05T21:30:46Z

config/v1/types_cluster_operator.go

 	// does not match its desired state over a period of time resulting in a lower
 	// quality of service. The period of time may vary by component, but a Degraded
-	// state represents persistent observation of a condition. As a result, a
+	// state represents persistent observation of a condition, and it may require
+	// immediate administrator intervention. As a result, a


Degraded=True doesn't require immediate admin intervention. It feeds the warning ClusterOperatorDegraded alert, where timely (e.g. next business-day) response is expected.

Available!=True, in contrast, triggers the critical ClusterOperatorDown, and that's calling for a midnight-page immediate response.

wking · 2025-09-05T21:34:28Z

config/v1/types_cluster_operator.go

@@ -158,12 +158,15 @@ const (
 	OperatorAvailable ClusterStatusConditionType = "Available"

 	// OperatorProgressing indicates that the component (operator and all configured operands)
-	// is actively rolling out new code, propagating config changes, or otherwise
+	// is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise
 	// moving from one steady state to another. Operators should not report
 	// progressing when they are reconciling (without action) a previously known
 	// state. If the observed cluster state has changed and the component is
 	// reacting to it (scaling up for instance), Progressing should become true
 	// since it is moving from one steady state to another.


Did you want to crisp up this scaling up for instance sentence, while you're here? Some components currently go Progressing=True when a DaemonSet launches new Pods in response to Node scaling. That's "the component is reacting to it", although it's not "the cluster operator controller is reacting" directly, it's down at the "the DaemonSet controller is reacting" level. But going Progressing=True on Node-scale-up can happen a lot in heavily-autoscaled clusters, and it makes Progressing less useful for other consumers (e.g. those who hope to use it to talk about an in-progress ClusterVersion update).

openshift-ci · 2025-09-05T21:39:12Z

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hongkailiu added 2 commits September 5, 2025 16:19

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 5, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2025

openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 5, 2025

openshift-ci bot requested review from everettraven and JoelSpeed September 5, 2025 20:31

wking reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]NO-JIRA: New rules about CO's conditions #2469

[WIP]NO-JIRA: New rules about CO's conditions #2469

hongkailiu commented Sep 5, 2025 •

edited

Loading

Uh oh!

openshift-ci-robot commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

wking Sep 5, 2025

Uh oh!

wking Sep 5, 2025

Uh oh!

wking Sep 5, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

Uh oh!

[WIP]NO-JIRA: New rules about CO's conditions #2469

Are you sure you want to change the base?

[WIP]NO-JIRA: New rules about CO's conditions #2469

Conversation

hongkailiu commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

wking Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

wking Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

wking Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

Uh oh!

hongkailiu commented Sep 5, 2025 •

edited

Loading

wking Sep 5, 2025 •

edited

Loading