-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Send default tasks to unversioned queue when user data disabled #4610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
service/matching/matching_engine.go
Outdated
| return taskQueue, userDataChanged, err | ||
| if errors.Is(err, errUserDataDisabled) && buildId == "" { | ||
| // When user data disabled, send "default" tasks to unversioned queue. | ||
| return taskQueue, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll still want to return the userDataChanged channel here instead of explicitly returning nil because user data may later be enabled (when we start periodically reading this config).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, that's true
tests/versioning_test.go
Outdated
| BuildID: "v1", | ||
| UseBuildIDForVersioning: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems irrelevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought "use versioning: false" would be nice to be explicit. I can remove
| s.Require().ErrorAs(err, &timeoutError) | ||
|
|
||
| // should not run on versioned worker | ||
| time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of issues here:
- Relying on sleeps in tests is a recipe for flakiness, I'd rather avoid that if possible.
- More importantly, should we treat versioned workers as unversioned when user data is disabled? If we ever turn on the kill switch, it's going to be very painful for users, requiring new worker deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I mean, sure, ideally. But we're pretty far gone down the timing road in these tests. Do you have a specific suggestion for avoiding sleeps here? Note that the existing test also implicitly relied on a sleep: the 5s timeout.
- Well, the current version of the PR does not send versioned tasks to unversioned anymore. So it's still going to be painful but in a different way (refreshing tasks). I'm open to ideas here but we discussed them all before and I'm not sure we're going to come to a different conclusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been playing with the DLQ idea for versioned tasks and I think it's a better solution. I'll do it in a follow-up PR. This one should be merged first since it fixes an existing bug and adds more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the existing test also implicitly relied on a sleep: the 5s timeout.
That's not a flaky condition though, it should deterministically fail.
I don't have good idea though for an integration test that can't inspect the internal state of the system.
For (2), let's keep the decision not to break the semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is just as flaky as this 2s sleep: both of them can spuriously pass, but not spuriously fail. It's just the history service noticing that 5s has passed without the workflow completing yet, vs this test noticing that 2s has passed without the workflow running yet.
service/frontend/workflow_handler.go
Outdated
| ctxDeadline, ok := ctx.Deadline() | ||
| if ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if ctxDeadline, ok := ctx.Deadline(); ok so that we don't accidentally use this outside of the scope if it's not present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(reverted now, but fyi this wasn't new code, I just indented it. you might want to look with whitespace diffs off, it's easier to read)
What changed?
Fixes the intended behavior when user data loading is disabled by dynamic config: to drop versioned tasks and send unversioned and "default" tasks to run on the unversioned queue. This previously worked for spooled tasks but not new tasks.
Also remove frontend logs for polls when this situation occurs. (They are still logged by the matching client at Info level though.)
Why?
When the config is switched off, we want workflows and workers not using the versioning feature to not be affected at all. So we should continue to match tasks without an assigned version (both "unversioned" and "default") to workers not using versioning (the "unversioned queue").
For versioned tasks, all the choice have various bad tradeoffs. So far we picked the option to drop versioned tasks, which essentially blocks versioned workflows and requires manual recovery (using admin rpc
RefreshWorkflowTasks).How did you test it?
Updated some integration tests and unit tests, ran other integration tests with user data loading disabled.
Potential risks
Dropping versioned tasks requires awkward manual recovery
Is hotfix candidate?