-
Notifications
You must be signed in to change notification settings - Fork 1.2k
clear drainage status of a rollbacked version workflow #8119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
carlydf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job with the repro and the fix!
I know it's annoying, but I really think that we should have solid replay tests for this patch, particularly if it's going into the OSS patch release.
I think it's worth doing the patch test ticket before merging this, so that this can be replay tested
- When a deactivated (drained/draining) version gets rollbacked to becoming active again, the drainage information that was associated with it will get cleared. - This is required because when this version were to get deactivated again, we need the code to start the drainage process on it again. Right now, it was not doing that since we were not clearing the drainage information which did not start a fresh drainage tracker for this version! *NOTE*: This change requires a patch to be present. This is because when this scenario is replicated in production, it will result in two new history calls (syncSummary as well as a CAN of the version workflow). This will specifically happen when the rolled-back version gets deactivated eventually, over [here](https://github.com/temporalio/temporal/blob/ada5d0bbee45150888b63a132be8239fd5a87212/service/worker/workerdeployment/version_workflow.go#L546) - note how previously, since the drainage information was not cleared, [this](https://github.com/temporalio/temporal/blob/ada5d0bbee45150888b63a132be8239fd5a87212/service/worker/workerdeployment/version_workflow.go#L261) was never executed!) *How did I verify that this requires a patch*: - Ran the replay test script before the fix introduced in this PR and noticed: <img width="793" height="79" alt="image" src="https://github.com/user-attachments/assets/bb975d26-4ad1-4847-b477-b74c7d2f55e4" /> - Ran the replay test script after the fix introduced in this PR and noticed: <img width="911" height="99" alt="image" src="https://github.com/user-attachments/assets/ad3ff3e1-666a-49b4-bda4-e05f3412d984" /> In other words, the total number of our entity workflows went up after introducing this change and thus requires a patch gate! - we found a bug in the versioning entity workflows and this is an effort to fix this! - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [x] added new functional test(s) - yes, there is a risk of NDE's happening in production but I mitigated that risk by placing the fix against a version gate.
What changed?
NOTE: This change requires a patch to be present. This is because when this scenario is replicated in production, it will result in two new history calls (syncSummary as well as a CAN of the version workflow). This will specifically happen when the rolled-back version gets deactivated eventually, over here - note how previously, since the drainage information was not cleared, this was never executed!)
How did I verify that this requires a patch:
In other words, the total number of our entity workflows went up after introducing this change and thus requires a patch gate!
Why?
How did you test it?
Potential risks