-
Notifications
You must be signed in to change notification settings - Fork 64
[epic] Add retry limits and a new reason to the Progressing
status condition that is present when retry limit is reached.
#1440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looking at Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know. This may be naive, but what if we introduced a new ~ In case of an upgrade error, our status might look like: - lastTransitionTime: <time>
message: <detailed error message>
observedGeneration: 2
reason: Upgrade // or Rollback etc. - action which is failing
status: "True"
type: ProgressingError
- lastTransitionTime: <time>
message: "Upgrading x to version x.x.x"
observedGeneration: 2
reason: Retrying
status: "True"
type: Progressing With this I think we would be able:
WDYT? |
Depends on the user :). The The idea of following this pattern was so that tooling like
I believe there is existing logic within both operator-controller and catalogd related to "terminal" (non-recoverable) and recoverable errors. If a retry limit is too tricky to implement, you could always consider a progression deadline that if you are in a state of
Why not just use the existing |
IIRC, if an error is not recoverable, we can immediately set Progressing=False with the error. If the error is recoverable, then we wouldn't set Progressing=False until Here's the
|
That's interesting, I did not know this was how state signaling worked in
Yes, I wasn't questioning the idea to move away from overloading
My thinking was that while there might be an error happening for some time, an issue causing that error might be removed externally at any point which could make the reconciliation progress to completion. So would an idea here be that even though we'd set Progressing=False conveying that this is a 'terminal' error, we'd still be trying to retry and reconcile in controller?
This is exactly how it works now and it means Progressing is overloaded with good/bad state, so you have to look for that error in message rather than have a clear way of knowing there's an issue. |
Issues go stale after 90 days of inactivity. If there is no further activity, the issue will be closed in another 30 days. |
As part of the v1 API stabilization effort, a few of us (@joelanford @grokspawn @LalatenduMohanty and I) identified that it would be a better user experience if the
Progressing
condition when set toTrue
was considered a "happy" state where generally everything is okay and we are actively attempting to progress towards a future state OR are at the desired state and ready to progress towards any future desired states. We felt thatRetrying
is currently a mix of a "sad" and "happy" reason but is generally more reflective of "we encountered a hiccup in progressing" as opposed to requiring user intervention.In an effort to make the
Retrying
reason be considered "happier", it was concluded that it would be best to add some form of a retry limit that when reached would result in theProgressing
condition being set toFalse
with a reason along the lines ofRetryLimitExceeded
to signal to users that the ClusterExtension/ClusterCatalog is no longer in a "happy" state.The text was updated successfully, but these errors were encountered: