[epic] Add retry limits and a new reason to the `Progressing` status condition that is present when retry limit is reached. #1440

everettraven · 2024-11-08T21:50:14Z

As part of the v1 API stabilization effort, a few of us (@joelanford @grokspawn @LalatenduMohanty and I) identified that it would be a better user experience if the Progressing condition when set to True was considered a "happy" state where generally everything is okay and we are actively attempting to progress towards a future state OR are at the desired state and ready to progress towards any future desired states. We felt that Retrying is currently a mix of a "sad" and "happy" reason but is generally more reflective of "we encountered a hiccup in progressing" as opposed to requiring user intervention.

In an effort to make the Retrying reason be considered "happier", it was concluded that it would be best to add some form of a retry limit that when reached would result in the Progressing condition being set to False with a reason along the lines of RetryLimitExceeded to signal to users that the ClusterExtension/ClusterCatalog is no longer in a "happy" state.

The text was updated successfully, but these errors were encountered:

azych · 2025-03-06T12:32:55Z

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

This may be naive, but what if we introduced a new ~ProgressingError condition?

In case of an upgrade error, our status might look like:

  - lastTransitionTime: <time>
    message: <detailed error message>
    observedGeneration: 2
    reason: Upgrade // or Rollback etc. - action which is failing
    status: "True"
    type: ProgressingError
  - lastTransitionTime: <time>
    message: "Upgrading x to version x.x.x"
    observedGeneration: 2
    reason: Retrying
    status: "True"
    type: Progressing

With this I think we would be able:

clearly communicate state in a predictable way
have a straightforward means of detecting that something is wrong, which is the main issue here

WDYT?

@everettraven @joelanford @grokspawn @LalatenduMohanty

everettraven · 2025-03-06T21:05:52Z

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Depends on the user :). The Deployment API in Kubernetes sets Progressing to True even when the rollout is completed: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

The idea of following this pattern was so that tooling like kubectl wait --for=Condition=Status could be used to check if a ClusterExtension or ClusterCatalog is in a good or bad state. If you overload Progressing=False as both a happy and a sad state it makes it harder to use something like kubectl wait ... for an evaluation because you then have to look at the reason. So Progressing=True as a catch all for the happy state and Progressing=False as a catch all for the sad state enables user workflows that use a quick glance of Condition+Status to know if something is generally OK or needs more attention.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

I believe there is existing logic within both operator-controller and catalogd related to "terminal" (non-recoverable) and recoverable errors. If a retry limit is too tricky to implement, you could always consider a progression deadline that if you are in a state of Progressing=True, Retrying for a duration that exceeds the deadline you could go to Progressing=False, ProgressionDeadlineExceeded.

This may be naive, but what if we introduced a new ~ProgressingError condition?

Why not just use the existing Progressing condition's message field to show the error?

joelanford · 2025-03-11T15:38:18Z

I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable.

IIRC, if an error is not recoverable, we can immediately set Progressing=False with the error. If the error is recoverable, then we wouldn't set Progressing=False until spec.progressDeadlineSeconds expires.

Here's the deployment.spec.progressDeadlineSeconds API doc:

progressDeadlineSeconds
The maximum time in seconds for a deployment to make progress before it is
considered to be failed. The deployment controller will continue to process
failed deployments and a condition with a ProgressDeadlineExceeded reason
will be surfaced in the deployment status. Note that progress will not be
estimated during the time a deployment is paused. Defaults to 600s.

azych · 2025-03-12T13:27:35Z

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Depends on the user :). The Deployment API in Kubernetes sets Progressing to True even when the rollout is completed: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

That's interesting, I did not know this was how state signaling worked in Deployments. It's a little unintuitive to me (progressing being "true" could mean you are ready to progress rather than actually progressing;)), but I guess once you know that's how it works it makes more sense.

The idea of following this pattern was so that tooling like kubectl wait --for=Condition=Status could be used to check if a ClusterExtension or ClusterCatalog is in a good or bad state. If you overload Progressing=False as both a happy and a sad state it makes it harder to use something like kubectl wait ... for an evaluation because you then have to look at the reason. So Progressing=True as a catch all for the happy state and Progressing=False as a catch all for the sad state enables user workflows that use a quick glance of Condition+Status to know if something is generally OK or needs more attention.

Yes, I wasn't questioning the idea to move away from overloading Progressing, just thinking about an alternative and potentially more intuitive way to approach it.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

I believe there is existing logic within both operator-controller and catalogd related to "terminal" (non-recoverable) and recoverable errors. If a retry limit is too tricky to implement, you could always consider a progression deadline that if you are in a state of Progressing=True, Retrying for a duration that exceeds the deadline you could go to Progressing=False, ProgressionDeadlineExceeded.

My thinking was that while there might be an error happening for some time, an issue causing that error might be removed externally at any point which could make the reconciliation progress to completion. So would an idea here be that even though we'd set Progressing=False conveying that this is a 'terminal' error, we'd still be trying to retry and reconcile in controller?

This may be naive, but what if we introduced a new ~ProgressingError condition?

Why not just use the existing Progressing condition's message field to show the error?

This is exactly how it works now and it means Progressing is overloaded with good/bad state, so you have to look for that error in message rather than have a clear way of knowing there's an issue.
With the new proposed condition you'd have a clear (and IMO intuitive) separation and you're able to convey additional useful information, like what's the action you're progressing based on (upgrade, installation etc.)?

github-actions · 2025-06-11T01:34:44Z

Issues go stale after 90 days of inactivity. If there is no further activity, the issue will be closed in another 30 days.

everettraven added epic v1.1 labels Nov 8, 2024

joelanford added this to OLM v1 Nov 8, 2024

LalatenduMohanty added the v1.x Issues related to OLMv1 features that come after 1.0 label Dec 10, 2024

LalatenduMohanty removed the v1.1 label Feb 4, 2025

joelanford mentioned this issue Mar 3, 2025

✨ Add command to create a new olmv1 catalog operator-framework/kubectl-operator#220

Merged

camilamacedo86 mentioned this issue May 22, 2025

🐛 Handle helm interrupt #1987

Open

4 tasks

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[epic] Add retry limits and a new reason to the `Progressing` status condition that is present when retry limit is reached. #1440

[epic] Add retry limits and a new reason to the `Progressing` status condition that is present when retry limit is reached. #1440

everettraven commented Nov 8, 2024

azych commented Mar 6, 2025 •

edited

Loading

Uh oh!

everettraven commented Mar 6, 2025 •

edited

Loading

Uh oh!

joelanford commented Mar 11, 2025 •

edited

Loading

Uh oh!

azych commented Mar 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

[epic] Add retry limits and a new reason to the Progressing status condition that is present when retry limit is reached. #1440

[epic] Add retry limits and a new reason to the Progressing status condition that is present when retry limit is reached. #1440

Comments

everettraven commented Nov 8, 2024

azych commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

everettraven commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joelanford commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azych commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

[epic] Add retry limits and a new reason to the `Progressing` status condition that is present when retry limit is reached. #1440

[epic] Add retry limits and a new reason to the `Progressing` status condition that is present when retry limit is reached. #1440

azych commented Mar 6, 2025 •

edited

Loading

everettraven commented Mar 6, 2025 •

edited

Loading

joelanford commented Mar 11, 2025 •

edited

Loading

azych commented Mar 12, 2025 •

edited

Loading