Force terminate workflow execution when exceeds system limit #743

samarabbas · 2020-09-20T00:23:51Z

Moved the logic to force terminate workflow execution to
workflowExecutionContext to make sure it is triggered for all cases
when workflow execution exceeds size limit.

Previously we are only performing this check on workflow task
completion. This could result in scenarios where workflow would
never be forced terminated.

Removed the logic to fail the workflow on command processing as
it is redundant after moving force termination logic to more generic
location.

Moved the logic to force terminate workflow execution to workflowExecutionContext to make sure it is triggered for all cases when workflow execution exceeds size limit. Previously we are only performing this check on workflow task completion. This could result in scenarios where workflow would never be forced terminated. Removed the logic to fail the workflow on command processing as it is redundant after moving force termination logic to more generic location.

wxing1292 · 2020-09-20T18:53:12Z

LGTM

service/history/workflowExecutionContext.go

mastermanu · 2020-09-20T22:00:40Z

service/history/workflowExecutionContext.go

+		return true, nil
+	}
+
+	if historySize > historySizeLimitWarn || historyCount > historyCountLimitWarn {


Can we also do the following:

Separate, independent logs for size vs count
In the log, can we also trace out the warn threshold? That just makes it a bit more useful from a debuggability point of view.

We should do the same thing for error threshold above as well

also, we will trace here if the workflow execution is not running. Is that the intent? or should we exit this entire function early if the workflow is not running?

Yes it is intentional to warn even for closed execution. This is to let users know that they are running close to size limit.
I kind of disagree on having separate logs for size and count. We include both as tags so it's easy to filter on either.

mastermanu · 2020-09-20T22:07:08Z

service/history/workflowExecutionContext.go

+			return false, err
+		}
+
+		// Return true to caller to indicate workflow is forced terminated


[More a question for my understanding] - this comment indicates that the Workflow WILL be force terminated right? All we are doing here is setting up the state and then the actual update call will do the storage write to actually terminate it?

That is correct. Updated comment.

mastermanu · 2020-09-20T22:08:43Z

host/sizelimit_test.go

+		}
+	}
+	// Signalling workflow should result in force terminating the workflow execution and returns with ResourceExhausted
+	// error.  ResourceExhaused is retried by the client and eventually results in NotFound error returned back to the


[Nit] typo - "ResourceExhaused"

Also, that seems hard to diagnose from the client side, because the client won't straight-up get a ResourceExhausted error, but a NotFound error which could be confusing. Do we have a separate item to introduce a different error type so the client explicitly knows not to retry?

Having it as a retryable error is the safest thing to do right now. We can revisit this after V1 release.
The right long term fix is to have each caller be aware of forced termination and explicitly handle force terminate scenario based on semantics of the API call. There are legitimate use cases when retried api call on force termination may succeed, like SignalWithStart.

samarabbas requested review from alexshtin and mastermanu September 20, 2020 00:23

samarabbas force-pushed the force-terminate-workflow-execution branch from 2f7491e to 818018f Compare September 20, 2020 00:24

samarabbas added 2 commits September 20, 2020 10:42

Change return error to ResourceExhausted

c0db723

Merge branch 'master' into force-terminate-workflow-execution

f8deac1

samarabbas force-pushed the force-terminate-workflow-execution branch from 6c3a04d to f8deac1 Compare September 20, 2020 17:43

mastermanu approved these changes Sep 20, 2020

View reviewed changes

address cr feedback

8ecdb2f

samarabbas merged commit d572abd into temporalio:master Sep 21, 2020

samarabbas deleted the force-terminate-workflow-execution branch September 21, 2020 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force terminate workflow execution when exceeds system limit #743

Force terminate workflow execution when exceeds system limit #743

Uh oh!

samarabbas commented Sep 20, 2020

Uh oh!

wxing1292 commented Sep 20, 2020

Uh oh!

Uh oh!

mastermanu Sep 20, 2020

Uh oh!

mastermanu Sep 20, 2020

Uh oh!

samarabbas Sep 21, 2020

Uh oh!

mastermanu Sep 20, 2020

Uh oh!

samarabbas Sep 21, 2020

Uh oh!

mastermanu Sep 20, 2020

Uh oh!

samarabbas Sep 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Force terminate workflow execution when exceeds system limit #743

Force terminate workflow execution when exceeds system limit #743

Uh oh!

Conversation

samarabbas commented Sep 20, 2020

Uh oh!

wxing1292 commented Sep 20, 2020

Uh oh!

Uh oh!

mastermanu Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

mastermanu Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

samarabbas Sep 21, 2020

Choose a reason for hiding this comment

Uh oh!

mastermanu Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

samarabbas Sep 21, 2020

Choose a reason for hiding this comment

Uh oh!

mastermanu Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

samarabbas Sep 21, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants