-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Force terminate workflow execution when exceeds system limit #743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force terminate workflow execution when exceeds system limit #743
Conversation
Moved the logic to force terminate workflow execution to workflowExecutionContext to make sure it is triggered for all cases when workflow execution exceeds size limit. Previously we are only performing this check on workflow task completion. This could result in scenarios where workflow would never be forced terminated. Removed the logic to fail the workflow on command processing as it is redundant after moving force termination logic to more generic location.
2f7491e to
818018f
Compare
6c3a04d to
f8deac1
Compare
|
LGTM |
| return true, nil | ||
| } | ||
|
|
||
| if historySize > historySizeLimitWarn || historyCount > historyCountLimitWarn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also do the following:
Separate, independent logs for size vs count
In the log, can we also trace out the warn threshold? That just makes it a bit more useful from a debuggability point of view.
We should do the same thing for error threshold above as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, we will trace here if the workflow execution is not running. Is that the intent? or should we exit this entire function early if the workflow is not running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is intentional to warn even for closed execution. This is to let users know that they are running close to size limit.
I kind of disagree on having separate logs for size and count. We include both as tags so it's easy to filter on either.
| return false, err | ||
| } | ||
|
|
||
| // Return true to caller to indicate workflow is forced terminated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[More a question for my understanding] - this comment indicates that the Workflow WILL be force terminated right? All we are doing here is setting up the state and then the actual update call will do the storage write to actually terminate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. Updated comment.
host/sizelimit_test.go
Outdated
| } | ||
| } | ||
| // Signalling workflow should result in force terminating the workflow execution and returns with ResourceExhausted | ||
| // error. ResourceExhaused is retried by the client and eventually results in NotFound error returned back to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] typo - "ResourceExhaused"
Also, that seems hard to diagnose from the client side, because the client won't straight-up get a ResourceExhausted error, but a NotFound error which could be confusing. Do we have a separate item to introduce a different error type so the client explicitly knows not to retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it as a retryable error is the safest thing to do right now. We can revisit this after V1 release.
The right long term fix is to have each caller be aware of forced termination and explicitly handle force terminate scenario based on semantics of the API call. There are legitimate use cases when retried api call on force termination may succeed, like SignalWithStart.
Moved the logic to force terminate workflow execution to
workflowExecutionContext to make sure it is triggered for all cases
when workflow execution exceeds size limit.
Previously we are only performing this check on workflow task
completion. This could result in scenarios where workflow would
never be forced terminated.
Removed the logic to fail the workflow on command processing as
it is redundant after moving force termination logic to more generic
location.