Skip to content

Conversation

@samarabbas
Copy link
Contributor

Moved the logic to force terminate workflow execution to
workflowExecutionContext to make sure it is triggered for all cases
when workflow execution exceeds size limit.

Previously we are only performing this check on workflow task
completion. This could result in scenarios where workflow would
never be forced terminated.

Removed the logic to fail the workflow on command processing as
it is redundant after moving force termination logic to more generic
location.

Moved the logic to force terminate workflow execution to
workflowExecutionContext to make sure it is triggered for all cases
when workflow execution exceeds size limit.

Previously we are only performing this check on workflow task
completion.  This could result in scenarios where workflow would
never be forced terminated.

Removed the logic to fail the workflow on command processing as
it is redundant after moving force termination logic to more generic
location.
@samarabbas samarabbas force-pushed the force-terminate-workflow-execution branch from 2f7491e to 818018f Compare September 20, 2020 00:24
@samarabbas samarabbas force-pushed the force-terminate-workflow-execution branch from 6c3a04d to f8deac1 Compare September 20, 2020 17:43
@wxing1292
Copy link
Contributor

LGTM

return true, nil
}

if historySize > historySizeLimitWarn || historyCount > historyCountLimitWarn {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also do the following:

Separate, independent logs for size vs count
In the log, can we also trace out the warn threshold? That just makes it a bit more useful from a debuggability point of view.

We should do the same thing for error threshold above as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we will trace here if the workflow execution is not running. Is that the intent? or should we exit this entire function early if the workflow is not running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is intentional to warn even for closed execution. This is to let users know that they are running close to size limit.
I kind of disagree on having separate logs for size and count. We include both as tags so it's easy to filter on either.

return false, err
}

// Return true to caller to indicate workflow is forced terminated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[More a question for my understanding] - this comment indicates that the Workflow WILL be force terminated right? All we are doing here is setting up the state and then the actual update call will do the storage write to actually terminate it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Updated comment.

}
}
// Signalling workflow should result in force terminating the workflow execution and returns with ResourceExhausted
// error. ResourceExhaused is retried by the client and eventually results in NotFound error returned back to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] typo - "ResourceExhaused"

Also, that seems hard to diagnose from the client side, because the client won't straight-up get a ResourceExhausted error, but a NotFound error which could be confusing. Do we have a separate item to introduce a different error type so the client explicitly knows not to retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it as a retryable error is the safest thing to do right now. We can revisit this after V1 release.
The right long term fix is to have each caller be aware of forced termination and explicitly handle force terminate scenario based on semantics of the API call. There are legitimate use cases when retried api call on force termination may succeed, like SignalWithStart.

@samarabbas samarabbas merged commit d572abd into temporalio:master Sep 21, 2020
@samarabbas samarabbas deleted the force-terminate-workflow-execution branch September 21, 2020 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants