-
Notifications
You must be signed in to change notification settings - Fork 562
[WIP]NO-JIRA: New rules about CO's conditions #2469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The essence of the new rules is that operators MUST not go Available=False or Degraded=True in an HA cluster during an uneventful CI upgrade. Those rules have applied in CI for a while [1, 2] and OCPBugs have been filed in this area. In order to avoid CI failing, many exceptions have been added in the tests [3, 4] as many of those bugs are still open. It is expected to invest effort to deliver the fixes of those bugs. [1]. https://issues.redhat.com/browse/OTA-700 [2]. https://issues.redhat.com/browse/TRT-1578 [3]. https://github.com/openshift/origin/blob/2af38a7807699b3046a73f931884152a11271d21/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L102 [4]. openshift/origin#27231
The essence of the new rule is that operators MUST complete their upgrade within 30 minutes in a cluster up to 250 nodes in size, except for Machine Config Operator which has 90 minutes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever an operator takes longer to upgrade than the given time. [1]. openshift/cluster-version-operator#1165
@hongkailiu: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hello @hongkailiu! Some important instructions when contributing to openshift/api: |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// is functional and available in the cluster. Available=False means at least | ||
// part of the component is non-functional, and that the condition requires | ||
// immediate administrator intervention. | ||
// A component must not report unavailable during the course of a normal upgrade except it is a single-node cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a single-node-cluster exception, because while they may satisfy "at least part of the component is non-functional" during a healthy update, I don't see how they'd trip "the condition requires immediate administrator intervention" during a healthy update. I understand that there's some ambiguous space around distinguishing "this isn't great, but might resolve on its own" and "this requires immediate admin intervention", but I think it's worth cluster operators investing in work to attempt that distinction, with "doesn't trip into Available=False
during a CI update that completed smoothly and automatically without issues" as a minimal compliance level.
// does not match its desired state over a period of time resulting in a lower | ||
// quality of service. The period of time may vary by component, but a Degraded | ||
// state represents persistent observation of a condition. As a result, a | ||
// state represents persistent observation of a condition, and it may require | ||
// immediate administrator intervention. As a result, a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Degraded=True
doesn't require immediate admin intervention. It feeds the warning
ClusterOperatorDegraded
alert, where timely (e.g. next business-day) response is expected.
Available!=True
, in contrast, triggers the critical
ClusterOperatorDown
, and that's calling for a midnight-page immediate response.
@@ -158,12 +158,15 @@ const ( | |||
OperatorAvailable ClusterStatusConditionType = "Available" | |||
|
|||
// OperatorProgressing indicates that the component (operator and all configured operands) | |||
// is actively rolling out new code, propagating config changes, or otherwise | |||
// is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise | |||
// moving from one steady state to another. Operators should not report | |||
// progressing when they are reconciling (without action) a previously known | |||
// state. If the observed cluster state has changed and the component is | |||
// reacting to it (scaling up for instance), Progressing should become true | |||
// since it is moving from one steady state to another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to crisp up this scaling up for instance
sentence, while you're here? Some components currently go Progressing=True
when a DaemonSet launches new Pods in response to Node scaling. That's "the component is reacting to it", although it's not "the cluster operator controller is reacting" directly, it's down at the "the DaemonSet controller is reacting" level. But going Progressing=True
on Node-scale-up can happen a lot in heavily-autoscaled clusters, and it makes Progressing
less useful for other consumers (e.g. those who hope to use it to talk about an in-progress ClusterVersion update).
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This pull introduces a few new rules about Cluster Operator's conditions
Operators MUST not go
Available=False
orDegraded=True
in an HA cluster during an uneventful CI upgrade. This has been exercised in CI OTA-700 and TRT-1578. It is expected to invest effort to deliver the fixes of those bugs.Operators MUST complete their upgrade within
30m
(with an exception of MCO within90m
) in a cluster up to 250 nodes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever an operator takes longer to upgrade than the given time. We plan to extend the CI coverage for this as well and spot violating cases and work on their fixes with the component team.After this pull gets in, I will use API docs to update dev-guide.