-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix retry for wrapped failures #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix retry for wrapped failures #490
Conversation
service/history/retry.go
Outdated
|
|
||
| func getCauseFailure(failure *failurepb.Failure) *failurepb.Failure { | ||
| for ; failure.GetCause() != nil; failure = failure.GetCause() { | ||
| // Unwrap failures till the first ApplicationFailure because only first ApplicationFailure controls retryable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to unwrap up to the first application failure. We want to get immediate cause and if it is not an application failure then stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rethink this logic myself: we will unwrap only ChildWorkflowExectutionFailure and ActivityFailure.
| } | ||
|
|
||
| if failure.GetApplicationFailureInfo() != nil { | ||
| if failure.GetApplicationFailureInfo().GetNonRetryable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] why the different style between line 106 (returning !failure.GetServerFailureInfo()) vs here (if true then return false)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ServerFailure doesn't have type field. Therefore retryable is controlled only by NonRetryable flag (if it is true, IsRetryable is false and vice versa). ApplicationFailure also has a type field which needs to be checked against nonRetryableTypes. So if NonRetryable is set we can return right away. If not, we need to check type.
service/history/retry.go
Outdated
| } | ||
|
|
||
| if failure.GetTimeoutFailureInfo() != nil { | ||
| if failure.GetTimeoutFailureInfo().GetTimeoutType() != enumspb.TIMEOUT_TYPE_START_TO_CLOSE && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] can we just do return TimeoutType() == START_TO_CLOSE || TimeoutType() == TIMEOUT_TYPE_HEARTBEAT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am so used to golang not having ?: operator, so I forgot that such returns are actually possible.
service/history/retry.go
Outdated
|
|
||
| func getCauseFailure(failure *failurepb.Failure) *failurepb.Failure { | ||
| for ; failure.GetCause() != nil; failure = failure.GetCause() { | ||
| // Extract cause for ChildWorkflowExecutionFailure and ActivityFailure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand this - the intent here is to keep going until we hit the innermost childworkflowexecutionfailure or activityfailureinfo, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is vice versa. We moving "down" the chain while failure type is child execution of activity and there is an inner cause. As soon as we got something different from child execution or activity, we stop and return it.
This is actual fix to the problem you reported. Before it goes down the last failure in the chain, which in your case, was generic go error converted to retryable ApplicationFailure.
What changed?
getCauseFailurestops at firstApplicationFailureInfo.Closes temporalio/sdk-go#175.
Why?
Non-retryable behaviour doesn't work correctly if non-retryable error has retryable error as a cause.
How did you test it?
Added bunch of unit tests to cover all possible cases.
Potential risks
No risks.