Skip to content

[Bug] Replay after UnhandledCommand can cause main workflow body to complete before signals are handled #427

@robcao

Description

@robcao

What are you really trying to do?

I have a workflow that continues as new, and also receives signals. We see that sometimes when the workflow tries to continue as new while new signals come in for that workflow, the continue as new output from the workflow does not contain all of the signals

Describe the bug

The problem is that the main workflow function is completing before all of the signal handlers run. Admittedly, I am not certain if this is a bug in the SDK or in the workflow implementation, but if it is the latter, this is very unintuitive behavior.

The problem happens when multiple signals come in when the workflow sends the continue as new command back to the server.

This causes the workflow to replay after receiving all of the new signals, but wait conditions are being checked after every individual signal handlers runs.

This can cause the following sequence of events.

  • Workflow main function is blocked on a wait condition
  • First signal handler runs
  • Workflow main function is now unblocked, workflow main function runs to completion
  • Second signal handler runs
    • Because the workflow main function is already completed, the results of this handler will not be included in scenarios like Continue as New

Minimal Reproduction

I created a repository here with the reproduction steps. There is a README file with instructions.

https://github.com/robcao/temporal-dotnet-missing-signal-sample

For ease of viewing, here is the workflow definition:

What appears to be happening is after the first signal handler runs, both of the wait conditions in the main function body are re-evaluated, both evaluate to true, and so the main function body runs before the second signal handler:

await Workflow.WhenAnyAsync(Workflow.DelayAsync(TimeSpan.FromSeconds(5)), Workflow.WaitConditionAsync(() => Signals.Count > 0));
await Workflow.WaitConditionAsync(() => Workflow.AllHandlersFinished);
[Workflow]
public class SleepThenReturn
{
	internal List<string> Signals { get; init; } = new();

	[WorkflowRun]
	public async Task<string[]> RunAsync(string[] input)
	{
		if (!string.IsNullOrWhiteSpace(Workflow.Info.ContinuedRunId))
		{
			Workflow.Logger.LogInformation("Now continuing as new, there are {count} signals.", input.Length);

			return input;
		}

		await Workflow.WhenAnyAsync(Workflow.DelayAsync(TimeSpan.FromSeconds(5)), Workflow.WaitConditionAsync(() => Signals.Count > 0));

		await Workflow.WaitConditionAsync(() => Workflow.AllHandlersFinished);

		List<string> next = new();

		foreach (string signal in Signals)
		{
			next.Add(signal);
		}

		throw Workflow.CreateContinueAsNewException<SleepThenReturn>(wf => wf.RunAsync(next.ToArray()));
	}

	[WorkflowSignal]
	public Task SendSignal(string signal)
	{
		Workflow.Logger.LogInformation("Handling signal input {signal}.", signal);
		Signals.Add(signal);
		return Task.CompletedTask;
	}
}

Environment/Versions

  • OS and processor: x64 Windows
  • Temporal Version: 1.5.0
  • Are you using Docker or Kubernetes or building Temporal from source? Running directly on Windows

Additional context

It seems like what we desire here is for handlers to be marked in progress earlier than they currently are, before the event loop runs, perhaps somewhere here: https://github.com/temporalio/sdk-dotnet/blob/main/src/Temporalio/Worker/WorkflowInstance.cs#L562

Another user noticed similar behavior to what we see in this Slack thread: https://temporalio.slack.com/archives/C012SHMPDDZ/p1738636421721339

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions