-
Notifications
You must be signed in to change notification settings - Fork 562
[WIP]NO-JIRA: New rules about CO's conditions #2469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -150,40 +150,45 @@ type ClusterOperatorStatusCondition struct { | |
type ClusterStatusConditionType string | ||
|
||
const ( | ||
// Available indicates that the component (operator and all configured operands) | ||
// OperatorAvailable indicates that the component (operator and all configured operands) | ||
// is functional and available in the cluster. Available=False means at least | ||
// part of the component is non-functional, and that the condition requires | ||
// immediate administrator intervention. | ||
// A component must not report unavailable during the course of a normal upgrade except it is a single-node cluster. | ||
OperatorAvailable ClusterStatusConditionType = "Available" | ||
|
||
// Progressing indicates that the component (operator and all configured operands) | ||
// is actively rolling out new code, propagating config changes, or otherwise | ||
// OperatorProgressing indicates that the component (operator and all configured operands) | ||
// is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise | ||
// moving from one steady state to another. Operators should not report | ||
// progressing when they are reconciling (without action) a previously known | ||
// state. If the observed cluster state has changed and the component is | ||
// reacting to it (scaling up for instance), Progressing should become true | ||
// since it is moving from one steady state to another. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you want to crisp up this |
||
// A component in a cluster with less than 250 nodes must complete a version | ||
// change within a limited period of time: 90 minutes for Machine Config Operator and 30 minutes for others. | ||
// Machine Config Operator is given more time as it needs to restart control planes. | ||
OperatorProgressing ClusterStatusConditionType = "Progressing" | ||
|
||
// Degraded indicates that the component (operator and all configured operands) | ||
// OperatorDegraded indicates that the component (operator and all configured operands) | ||
// does not match its desired state over a period of time resulting in a lower | ||
// quality of service. The period of time may vary by component, but a Degraded | ||
// state represents persistent observation of a condition. As a result, a | ||
// state represents persistent observation of a condition, and it may require | ||
// immediate administrator intervention. As a result, a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
// component should not oscillate in and out of Degraded state. A component may | ||
// be Available even if its degraded. For example, a component may desire 3 | ||
// running pods, but 1 pod is crash-looping. The component is Available but | ||
// Degraded because it may have a lower quality of service. A component may be | ||
// Progressing but not Degraded because the transition from one state to | ||
// another does not persist over a long enough period to report Degraded. A | ||
// component should not report Degraded during the course of a normal upgrade. | ||
// component must not report Degraded during the course of a normal upgrade except it is a single-node cluster. | ||
// A component may report Degraded in response to a persistent infrastructure | ||
// failure that requires eventual administrator intervention. For example, if | ||
// a control plane host is unhealthy and must be replaced. A component should | ||
// report Degraded if unexpected errors occur over a period, but the | ||
// expectation is that all unexpected errors are handled as operators mature. | ||
OperatorDegraded ClusterStatusConditionType = "Degraded" | ||
|
||
// Upgradeable indicates whether the component (operator and all configured | ||
// OperatorUpgradeable indicates whether the component (operator and all configured | ||
// operands) is safe to upgrade based on the current cluster state. When | ||
// Upgradeable is False, the cluster-version operator will prevent the | ||
// cluster from performing impacted updates unless forced. When set on | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a single-node-cluster exception, because while they may satisfy "at least part of the component is non-functional" during a healthy update, I don't see how they'd trip "the condition requires immediate administrator intervention" during a healthy update. I understand that there's some ambiguous space around distinguishing "this isn't great, but might resolve on its own" and "this requires immediate admin intervention", but I think it's worth cluster operators investing in work to attempt that distinction, with "doesn't trip into
Available=False
during a CI update that completed smoothly and automatically without issues" as a minimal compliance level.