Skip to content

Conversation

@yycptt
Copy link
Member

@yycptt yycptt commented Aug 14, 2022

What changed?

  • Do not mark all activities errors as non-retryable 😞
  • Change the condition for what errors is non-retryable for workflow deletion, more specifically, workflow not ready error and context errors (like context deadline exceeded, since each new attempt get a new startToClose timeout) should be retryable.

Why?

  • Bug fix

How did you test it?

  • Tested locally
  • Updated unit test

Potential risks

Is hotfix candidate?

  • Yes.

@yycptt yycptt requested review from yiminc and yux0 August 14, 2022 07:21
@yycptt yycptt requested a review from a team as a code owner August 14, 2022 07:21
if !common.IsPersistenceTransientError(err) {

if _, ok := err.(*serviceerror.WorkflowNotReady); !ok {
logger := tagLoggerWithHistoryRequest(tagLoggerWithActivityInfo(container.Logger, activity.GetInfo(ctx)), &request)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppress logs for notReady error which is expected for deleteWorkflowExecution call. The error will still be logged if all retry fails.

err = workflow.ExecuteLocalActivity(localActCtx, deleteHistoryActivity, *request).Get(localActCtx, nil)
if err != nil {
logger.Error("deleting history failed, this means zombie histories are left", tag.Error(err))
logger.Error("deleting workflow execution failed all retires, skip workflow deletion", tag.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the workflow data stay in Db if it just skip the deletion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Archival workflow will only retry up to 5min for both uploading and deletion and then give up. This is the limitation of the existing archival design. The issue will be gone once we have a separate archival queue.

For now, user should monitor the metrics for archival delete non-retryable error and use admin wf del command to manually delete those workflows from DB.

Comment on lines 66 to 67
} else {
err = temporal.NewApplicationError(err.Error(), "", nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is default error to ApplicationError conversion. I would just return err here.

Suggested change
} else {
err = temporal.NewApplicationError(err.Error(), "", nil)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And everywhere bellow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Then no change is needed to err and now the defer looks like the following:

	defer func() {
		sw.Stop()
		if err == errUploadNonRetryable {
			scope.IncCounter(metrics.ArchiverNonRetryableErrorCount)
		}
	}()

@yycptt yycptt merged commit 4873231 into temporalio:master Aug 15, 2022
@yycptt yycptt deleted the fix-archival-del branch August 15, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants