Skip to content

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Sep 5, 2025

This pull introduces a few new rules about Cluster Operator's conditions

  • Operators MUST not go Available=False or Degraded=True in an HA cluster during an uneventful CI upgrade. This has been exercised in CI OTA-700 and TRT-1578. It is expected to invest effort to deliver the fixes of those bugs.

  • Operators MUST complete their upgrade within 30m (with an exception of MCO within 90m) in a cluster up to 250 nodes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever an operator takes longer to upgrade than the given time. We plan to extend the CI coverage for this as well and spot violating cases and work on their fixes with the component team.

After this pull gets in, I will use API docs to update dev-guide.

The essence of the new rules is that operators MUST not go Available=False
or Degraded=True in an HA cluster during an uneventful CI upgrade.

Those rules have applied in CI for a while [1, 2] and OCPBugs have been
filed in this area. In order to avoid CI failing, many exceptions have been
added in the tests [3, 4] as many of those bugs are still open. It is expected
to invest effort to deliver the fixes of those bugs.

[1]. https://issues.redhat.com/browse/OTA-700

[2]. https://issues.redhat.com/browse/TRT-1578

[3]. https://github.com/openshift/origin/blob/2af38a7807699b3046a73f931884152a11271d21/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L102

[4]. openshift/origin#27231
The essence of the new rule is that operators MUST complete their
upgrade within 30 minutes in a cluster up to 250 nodes in size, except
for Machine Config Operator which has 90 minutes.

This formalizes the changes introduced from cluster-version-operator#1165
where CVO begins complaining (Failing=Unknown) whenever an operator
takes longer to upgrade than the given time.

[1]. openshift/cluster-version-operator#1165
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 5, 2025
@openshift-ci-robot
Copy link

@hongkailiu: This pull request explicitly references no jira issue.

In response to this:

This pull introduces a few new rules about Cluster Operator's conditions

  • Operators MUST not go Available=False or Degraded=True in an HA cluster during an uneventful CI upgrade. This has been exercised in CI OTA-700 and TRT-1578. It is expected
    to invest effort to deliver the fixes of those bugs.

  • Operators MUST complete their upgrade within 30m (with an exception of MCO within 90m) in a cluster up to 250 nodes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever an operator takes longer to upgrade than the given time. We plan to extend the CI coverage for this as well and spot violating cases and work on their fixes with the component team.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2025
Copy link
Contributor

openshift-ci bot commented Sep 5, 2025

Hello @hongkailiu! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 5, 2025
Copy link
Contributor

openshift-ci bot commented Sep 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu
Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

// is functional and available in the cluster. Available=False means at least
// part of the component is non-functional, and that the condition requires
// immediate administrator intervention.
// A component must not report unavailable during the course of a normal upgrade except it is a single-node cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a single-node-cluster exception, because while they may satisfy "at least part of the component is non-functional" during a healthy update, I don't see how they'd trip "the condition requires immediate administrator intervention" during a healthy update. I understand that there's some ambiguous space around distinguishing "this isn't great, but might resolve on its own" and "this requires immediate admin intervention", but I think it's worth cluster operators investing in work to attempt that distinction, with "doesn't trip into Available=False during a CI update that completed smoothly and automatically without issues" as a minimal compliance level.

// does not match its desired state over a period of time resulting in a lower
// quality of service. The period of time may vary by component, but a Degraded
// state represents persistent observation of a condition. As a result, a
// state represents persistent observation of a condition, and it may require
// immediate administrator intervention. As a result, a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Degraded=True doesn't require immediate admin intervention. It feeds the warning ClusterOperatorDegraded alert, where timely (e.g. next business-day) response is expected.

Available!=True, in contrast, triggers the critical ClusterOperatorDown, and that's calling for a midnight-page immediate response.

@@ -158,12 +158,15 @@ const (
OperatorAvailable ClusterStatusConditionType = "Available"

// OperatorProgressing indicates that the component (operator and all configured operands)
// is actively rolling out new code, propagating config changes, or otherwise
// is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise
// moving from one steady state to another. Operators should not report
// progressing when they are reconciling (without action) a previously known
// state. If the observed cluster state has changed and the component is
// reacting to it (scaling up for instance), Progressing should become true
// since it is moving from one steady state to another.
Copy link
Member

@wking wking Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to crisp up this scaling up for instance sentence, while you're here? Some components currently go Progressing=True when a DaemonSet launches new Pods in response to Node scaling. That's "the component is reacting to it", although it's not "the cluster operator controller is reacting" directly, it's down at the "the DaemonSet controller is reacting" level. But going Progressing=True on Node-scale-up can happen a lot in heavily-autoscaled clusters, and it makes Progressing less useful for other consumers (e.g. those who hope to use it to talk about an in-progress ClusterVersion update).

Copy link
Contributor

openshift-ci bot commented Sep 5, 2025

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants