Skip to content

[epic] Add retry limits and a new reason to the Progressing status condition that is present when retry limit is reached. #1440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
everettraven opened this issue Nov 8, 2024 · 5 comments
Labels
epic lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. v1.x Issues related to OLMv1 features that come after 1.0

Comments

@everettraven
Copy link
Contributor

As part of the v1 API stabilization effort, a few of us (@joelanford @grokspawn @LalatenduMohanty and I) identified that it would be a better user experience if the Progressing condition when set to True was considered a "happy" state where generally everything is okay and we are actively attempting to progress towards a future state OR are at the desired state and ready to progress towards any future desired states. We felt that Retrying is currently a mix of a "sad" and "happy" reason but is generally more reflective of "we encountered a hiccup in progressing" as opposed to requiring user intervention.

In an effort to make the Retrying reason be considered "happier", it was concluded that it would be best to add some form of a retry limit that when reached would result in the Progressing condition being set to False with a reason along the lines of RetryLimitExceeded to signal to users that the ClusterExtension/ClusterCatalog is no longer in a "happy" state.

@azych
Copy link
Contributor

azych commented Mar 6, 2025

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

This may be naive, but what if we introduced a new ~ProgressingError condition?

In case of an upgrade error, our status might look like:

  - lastTransitionTime: <time>
    message: <detailed error message>
    observedGeneration: 2
    reason: Upgrade // or Rollback etc. - action which is failing
    status: "True"
    type: ProgressingError
  - lastTransitionTime: <time>
    message: "Upgrading x to version x.x.x"
    observedGeneration: 2
    reason: Retrying
    status: "True"
    type: Progressing

With this I think we would be able:

  • clearly communicate state in a predictable way
  • have a straightforward means of detecting that something is wrong, which is the main issue here

WDYT?

@everettraven @joelanford @grokspawn @LalatenduMohanty

@everettraven
Copy link
Contributor Author

everettraven commented Mar 6, 2025

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Depends on the user :). The Deployment API in Kubernetes sets Progressing to True even when the rollout is completed: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

The idea of following this pattern was so that tooling like kubectl wait --for=Condition=Status could be used to check if a ClusterExtension or ClusterCatalog is in a good or bad state. If you overload Progressing=False as both a happy and a sad state it makes it harder to use something like kubectl wait ... for an evaluation because you then have to look at the reason. So Progressing=True as a catch all for the happy state and Progressing=False as a catch all for the sad state enables user workflows that use a quick glance of Condition+Status to know if something is generally OK or needs more attention.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

I believe there is existing logic within both operator-controller and catalogd related to "terminal" (non-recoverable) and recoverable errors. If a retry limit is too tricky to implement, you could always consider a progression deadline that if you are in a state of Progressing=True, Retrying for a duration that exceeds the deadline you could go to Progressing=False, ProgressionDeadlineExceeded.

This may be naive, but what if we introduced a new ~ProgressingError condition?

Why not just use the existing Progressing condition's message field to show the error?

@joelanford
Copy link
Member

joelanford commented Mar 11, 2025

I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable.

IIRC, if an error is not recoverable, we can immediately set Progressing=False with the error. If the error is recoverable, then we wouldn't set Progressing=False until spec.progressDeadlineSeconds expires.

Here's the deployment.spec.progressDeadlineSeconds API doc:

progressDeadlineSeconds
The maximum time in seconds for a deployment to make progress before it is
considered to be failed. The deployment controller will continue to process
failed deployments and a condition with a ProgressDeadlineExceeded reason
will be surfaced in the deployment status. Note that progress will not be
estimated during the time a deployment is paused. Defaults to 600s.

@azych
Copy link
Contributor

azych commented Mar 12, 2025

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Depends on the user :). The Deployment API in Kubernetes sets Progressing to True even when the rollout is completed: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

That's interesting, I did not know this was how state signaling worked in Deployments. It's a little unintuitive to me (progressing being "true" could mean you are ready to progress rather than actually progressing;)), but I guess once you know that's how it works it makes more sense.

The idea of following this pattern was so that tooling like kubectl wait --for=Condition=Status could be used to check if a ClusterExtension or ClusterCatalog is in a good or bad state. If you overload Progressing=False as both a happy and a sad state it makes it harder to use something like kubectl wait ... for an evaluation because you then have to look at the reason. So Progressing=True as a catch all for the happy state and Progressing=False as a catch all for the sad state enables user workflows that use a quick glance of Condition+Status to know if something is generally OK or needs more attention.

Yes, I wasn't questioning the idea to move away from overloading Progressing, just thinking about an alternative and potentially more intuitive way to approach it.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

I believe there is existing logic within both operator-controller and catalogd related to "terminal" (non-recoverable) and recoverable errors. If a retry limit is too tricky to implement, you could always consider a progression deadline that if you are in a state of Progressing=True, Retrying for a duration that exceeds the deadline you could go to Progressing=False, ProgressionDeadlineExceeded.

My thinking was that while there might be an error happening for some time, an issue causing that error might be removed externally at any point which could make the reconciliation progress to completion. So would an idea here be that even though we'd set Progressing=False conveying that this is a 'terminal' error, we'd still be trying to retry and reconcile in controller?

This may be naive, but what if we introduced a new ~ProgressingError condition?

Why not just use the existing Progressing condition's message field to show the error?

This is exactly how it works now and it means Progressing is overloaded with good/bad state, so you have to look for that error in message rather than have a clear way of knowing there's an issue.
With the new proposed condition you'd have a clear (and IMO intuitive) separation and you're able to convey additional useful information, like what's the action you're progressing based on (upgrade, installation etc.)?

Copy link

Issues go stale after 90 days of inactivity. If there is no further activity, the issue will be closed in another 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. v1.x Issues related to OLMv1 features that come after 1.0
Projects
Status: No status
Development

No branches or pull requests

4 participants