Send default tasks to unversioned queue when user data disabled #4610

dnr · 2023-07-10T22:52:48Z

What changed?
Fixes the intended behavior when user data loading is disabled by dynamic config: to drop versioned tasks and send unversioned and "default" tasks to run on the unversioned queue. This previously worked for spooled tasks but not new tasks.

Also remove frontend logs for polls when this situation occurs. (They are still logged by the matching client at Info level though.)

Why?
When the config is switched off, we want workflows and workers not using the versioning feature to not be affected at all. So we should continue to match tasks without an assigned version (both "unversioned" and "default") to workers not using versioning (the "unversioned queue").

For versioned tasks, all the choice have various bad tradeoffs. So far we picked the option to drop versioned tasks, which essentially blocks versioned workflows and requires manual recovery (using admin rpc RefreshWorkflowTasks).

How did you test it?
Updated some integration tests and unit tests, ran other integration tests with user data loading disabled.

Potential risks
Dropping versioned tasks requires awkward manual recovery

Is hotfix candidate?

bergundy · 2023-07-11T04:40:54Z

service/matching/matching_engine.go

-		return taskQueue, userDataChanged, err
+		if errors.Is(err, errUserDataDisabled) && buildId == "" {
+			// When user data disabled, send "default" tasks to unversioned queue.
+			return taskQueue, nil, nil


I think you'll still want to return the userDataChanged channel here instead of explicitly returning nil because user data may later be enabled (when we start periodically reading this config).

Ah yeah, that's true

tests/versioning_test.go

bergundy · 2023-07-11T04:46:44Z

tests/versioning_test.go

+		BuildID:                          "v1",
+		UseBuildIDForVersioning:          false,


Seems irrelevant

I thought "use versioning: false" would be nice to be explicit. I can remove

bergundy · 2023-07-11T04:54:24Z

tests/versioning_test.go

-	s.Require().ErrorAs(err, &timeoutError)
+
+	// should not run on versioned worker
+	time.Sleep(2 * time.Second)


I have a couple of issues here:

Relying on sleeps in tests is a recipe for flakiness, I'd rather avoid that if possible.

More importantly, should we treat versioned workers as unversioned when user data is disabled? If we ever turn on the kill switch, it's going to be very painful for users, requiring new worker deployments.

I mean, sure, ideally. But we're pretty far gone down the timing road in these tests. Do you have a specific suggestion for avoiding sleeps here? Note that the existing test also implicitly relied on a sleep: the 5s timeout.

Well, the current version of the PR does not send versioned tasks to unversioned anymore. So it's still going to be painful but in a different way (refreshing tasks). I'm open to ideas here but we discussed them all before and I'm not sure we're going to come to a different conclusion

I've been playing with the DLQ idea for versioned tasks and I think it's a better solution. I'll do it in a follow-up PR. This one should be merged first since it fixes an existing bug and adds more tests.

Note that the existing test also implicitly relied on a sleep: the 5s timeout.

That's not a flaky condition though, it should deterministically fail.
I don't have good idea though for an integration test that can't inspect the internal state of the system.

For (2), let's keep the decision not to break the semantics.

The timeout is just as flaky as this 2s sleep: both of them can spuriously pass, but not spuriously fail. It's just the history service noticing that 5s has passed without the workflow completing yet, vs this test noticing that 2s has passed without the workflow running yet.

service/frontend/workflow_handler.go

MichaelSnowden · 2023-07-12T17:45:51Z

service/frontend/workflow_handler.go

+			ctxDeadline, ok := ctx.Deadline()
+			if ok {


nit: if ctxDeadline, ok := ctx.Deadline(); ok so that we don't accidentally use this outside of the scope if it's not present

(reverted now, but fyi this wasn't new code, I just indented it. you might want to look with whitespace diffs off, it's easier to read)

tests/versioning_test.go

service/matching/matching_engine_test.go

dnr added the release/1.21.2 label Jul 10, 2023

dnr requested a review from a team as a code owner July 10, 2023 22:52

Send default tasks to unversioned queue when user data disabled

edab52f

dnr changed the title ~~Send tasks to unversioned queue when user data disabled~~ Send default tasks to unversioned queue when user data disabled Jul 11, 2023

dnr force-pushed the ver34 branch from cfc2789 to edab52f Compare July 11, 2023 04:00

remove more noisy logs

f3034a3

bergundy reviewed Jul 11, 2023

View reviewed changes

review comments

0c67971

bergundy approved these changes Jul 12, 2023

View reviewed changes

MichaelSnowden reviewed Jul 12, 2023

View reviewed changes

dnr added 3 commits July 12, 2023 12:59

use errors.As

5760df0

review comments

88cf932

Merge branch 'master' of github.com:temporalio/temporal into ver34

67fedad

MichaelSnowden approved these changes Jul 12, 2023

View reviewed changes

dnr merged commit 1b82336 into temporalio:master Jul 12, 2023

dnr deleted the ver34 branch July 12, 2023 21:40

dnr added a commit that referenced this pull request Jul 12, 2023

Send default tasks to unversioned queue when user data disabled (#4610)

74b51d8

Send default tasks to unversioned queue when user data disabled #4610

Send default tasks to unversioned queue when user data disabled #4610

Uh oh!

Conversation

dnr commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dnr commented Jul 10, 2023 •

edited

Loading