Skip to content

Conversation

@dnr
Copy link
Contributor

@dnr dnr commented Jul 10, 2023

What changed?
Fixes the intended behavior when user data loading is disabled by dynamic config: to drop versioned tasks and send unversioned and "default" tasks to run on the unversioned queue. This previously worked for spooled tasks but not new tasks.

Also remove frontend logs for polls when this situation occurs. (They are still logged by the matching client at Info level though.)

Why?
When the config is switched off, we want workflows and workers not using the versioning feature to not be affected at all. So we should continue to match tasks without an assigned version (both "unversioned" and "default") to workers not using versioning (the "unversioned queue").

For versioned tasks, all the choice have various bad tradeoffs. So far we picked the option to drop versioned tasks, which essentially blocks versioned workflows and requires manual recovery (using admin rpc RefreshWorkflowTasks).

How did you test it?
Updated some integration tests and unit tests, ran other integration tests with user data loading disabled.

Potential risks
Dropping versioned tasks requires awkward manual recovery

Is hotfix candidate?

@dnr dnr requested a review from a team as a code owner July 10, 2023 22:52
@dnr dnr changed the title Send tasks to unversioned queue when user data disabled Send default tasks to unversioned queue when user data disabled Jul 11, 2023
return taskQueue, userDataChanged, err
if errors.Is(err, errUserDataDisabled) && buildId == "" {
// When user data disabled, send "default" tasks to unversioned queue.
return taskQueue, nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll still want to return the userDataChanged channel here instead of explicitly returning nil because user data may later be enabled (when we start periodically reading this config).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, that's true

Comment on lines 1579 to 1580
BuildID: "v1",
UseBuildIDForVersioning: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems irrelevant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought "use versioning: false" would be nice to be explicit. I can remove

s.Require().ErrorAs(err, &timeoutError)

// should not run on versioned worker
time.Sleep(2 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of issues here:

  1. Relying on sleeps in tests is a recipe for flakiness, I'd rather avoid that if possible.
  2. More importantly, should we treat versioned workers as unversioned when user data is disabled? If we ever turn on the kill switch, it's going to be very painful for users, requiring new worker deployments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I mean, sure, ideally. But we're pretty far gone down the timing road in these tests. Do you have a specific suggestion for avoiding sleeps here? Note that the existing test also implicitly relied on a sleep: the 5s timeout.
  2. Well, the current version of the PR does not send versioned tasks to unversioned anymore. So it's still going to be painful but in a different way (refreshing tasks). I'm open to ideas here but we discussed them all before and I'm not sure we're going to come to a different conclusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been playing with the DLQ idea for versioned tasks and I think it's a better solution. I'll do it in a follow-up PR. This one should be merged first since it fixes an existing bug and adds more tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the existing test also implicitly relied on a sleep: the 5s timeout.

That's not a flaky condition though, it should deterministically fail.
I don't have good idea though for an integration test that can't inspect the internal state of the system.

For (2), let's keep the decision not to break the semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is just as flaky as this 2s sleep: both of them can spuriously pass, but not spuriously fail. It's just the history service noticing that 5s has passed without the workflow completing yet, vs this test noticing that 2s has passed without the workflow running yet.

Comment on lines 890 to 891
ctxDeadline, ok := ctx.Deadline()
if ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if ctxDeadline, ok := ctx.Deadline(); ok so that we don't accidentally use this outside of the scope if it's not present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(reverted now, but fyi this wasn't new code, I just indented it. you might want to look with whitespace diffs off, it's easier to read)

@dnr dnr merged commit 1b82336 into temporalio:master Jul 12, 2023
@dnr dnr deleted the ver34 branch July 12, 2023 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants