Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions config/v1/types_cluster_operator.go
Original file line number Diff line number Diff line change
Expand Up @@ -150,40 +150,45 @@ type ClusterOperatorStatusCondition struct {
type ClusterStatusConditionType string

const (
// Available indicates that the component (operator and all configured operands)
// OperatorAvailable indicates that the component (operator and all configured operands)
// is functional and available in the cluster. Available=False means at least
// part of the component is non-functional, and that the condition requires
// immediate administrator intervention.
// A component must not report unavailable during the course of a normal upgrade except it is a single-node cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a single-node-cluster exception, because while they may satisfy "at least part of the component is non-functional" during a healthy update, I don't see how they'd trip "the condition requires immediate administrator intervention" during a healthy update. I understand that there's some ambiguous space around distinguishing "this isn't great, but might resolve on its own" and "this requires immediate admin intervention", but I think it's worth cluster operators investing in work to attempt that distinction, with "doesn't trip into Available=False during a CI update that completed smoothly and automatically without issues" as a minimal compliance level.

OperatorAvailable ClusterStatusConditionType = "Available"

// Progressing indicates that the component (operator and all configured operands)
// is actively rolling out new code, propagating config changes, or otherwise
// OperatorProgressing indicates that the component (operator and all configured operands)
// is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise
// moving from one steady state to another. Operators should not report
// progressing when they are reconciling (without action) a previously known
// state. If the observed cluster state has changed and the component is
// reacting to it (scaling up for instance), Progressing should become true
// since it is moving from one steady state to another.
Copy link
Member

@wking wking Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to crisp up this scaling up for instance sentence, while you're here? Some components currently go Progressing=True when a DaemonSet launches new Pods in response to Node scaling. That's "the component is reacting to it", although it's not "the cluster operator controller is reacting" directly, it's down at the "the DaemonSet controller is reacting" level. But going Progressing=True on Node-scale-up can happen a lot in heavily-autoscaled clusters, and it makes Progressing less useful for other consumers (e.g. those who hope to use it to talk about an in-progress ClusterVersion update).

// A component in a cluster with less than 250 nodes must complete a version
// change within a limited period of time: 90 minutes for Machine Config Operator and 30 minutes for others.
// Machine Config Operator is given more time as it needs to restart control planes.
OperatorProgressing ClusterStatusConditionType = "Progressing"

// Degraded indicates that the component (operator and all configured operands)
// OperatorDegraded indicates that the component (operator and all configured operands)
// does not match its desired state over a period of time resulting in a lower
// quality of service. The period of time may vary by component, but a Degraded
// state represents persistent observation of a condition. As a result, a
// state represents persistent observation of a condition, and it may require
// immediate administrator intervention. As a result, a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Degraded=True doesn't require immediate admin intervention. It feeds the warning ClusterOperatorDegraded alert, where timely (e.g. next business-day) response is expected.

Available!=True, in contrast, triggers the critical ClusterOperatorDown, and that's calling for a midnight-page immediate response.

// component should not oscillate in and out of Degraded state. A component may
// be Available even if its degraded. For example, a component may desire 3
// running pods, but 1 pod is crash-looping. The component is Available but
// Degraded because it may have a lower quality of service. A component may be
// Progressing but not Degraded because the transition from one state to
// another does not persist over a long enough period to report Degraded. A
// component should not report Degraded during the course of a normal upgrade.
// component must not report Degraded during the course of a normal upgrade except it is a single-node cluster.
// A component may report Degraded in response to a persistent infrastructure
// failure that requires eventual administrator intervention. For example, if
// a control plane host is unhealthy and must be replaced. A component should
// report Degraded if unexpected errors occur over a period, but the
// expectation is that all unexpected errors are handled as operators mature.
OperatorDegraded ClusterStatusConditionType = "Degraded"

// Upgradeable indicates whether the component (operator and all configured
// OperatorUpgradeable indicates whether the component (operator and all configured
// operands) is safe to upgrade based on the current cluster state. When
// Upgradeable is False, the cluster-version operator will prevent the
// cluster from performing impacted updates unless forced. When set on
Expand Down