Skip to content

Conversation

@Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Jul 30, 2025

What changed?

  • When a deactivated (drained/draining) version gets rollbacked to becoming active again, the drainage information that was associated with it will get cleared.
  • This is required because when this version were to get deactivated again, we need the code to start the drainage process on it again. Right now, it was not doing that since we were not clearing the drainage information which did not start a fresh drainage tracker for this version!

NOTE: This change requires a patch to be present. This is because when this scenario is replicated in production, it will result in two new history calls (syncSummary as well as a CAN of the version workflow). This will specifically happen when the rolled-back version gets deactivated eventually, over here - note how previously, since the drainage information was not cleared, this was never executed!)

How did I verify that this requires a patch:

  • Ran the replay test script before the fix introduced in this PR and noticed:
image
  • Ran the replay test script after the fix introduced in this PR and noticed:
image

In other words, the total number of our entity workflows went up after introducing this change and thus requires a patch gate!

Why?

  • we found a bug in the versioning entity workflows and this is an effort to fix this!

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

  • yes, there is a risk of NDE's happening in production but I mitigated that risk by placing the fix against a version gate.

@Shivs11 Shivs11 requested a review from a team as a code owner July 30, 2025 16:37
@Shivs11 Shivs11 requested a review from carlydf July 30, 2025 16:39
Copy link
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job with the repro and the fix!
I know it's annoying, but I really think that we should have solid replay tests for this patch, particularly if it's going into the OSS patch release.

I think it's worth doing the patch test ticket before merging this, so that this can be replay tested

@Shivs11 Shivs11 enabled auto-merge (squash) July 31, 2025 19:50
@Shivs11 Shivs11 merged commit 9c48ca4 into main Jul 31, 2025
89 of 91 checks passed
@Shivs11 Shivs11 deleted the ss/version-workflow-drainage-sync-bug branch July 31, 2025 20:29
yycptt pushed a commit that referenced this pull request Aug 5, 2025
- When a deactivated (drained/draining) version gets rollbacked to
becoming active again, the drainage information that was associated with
it will get cleared.
- This is required because when this version were to get deactivated
again, we need the code to start the drainage process on it again. Right
now, it was not doing that since we were not clearing the drainage
information which did not start a fresh drainage tracker for this
version!

*NOTE*: This change requires a patch to be present. This is because when
this scenario is replicated in production, it will result in two new
history calls (syncSummary as well as a CAN of the version workflow).
This will specifically happen when the rolled-back version gets
deactivated eventually, over
[here](https://github.com/temporalio/temporal/blob/ada5d0bbee45150888b63a132be8239fd5a87212/service/worker/workerdeployment/version_workflow.go#L546)
- note how previously, since the drainage information was not cleared,
[this](https://github.com/temporalio/temporal/blob/ada5d0bbee45150888b63a132be8239fd5a87212/service/worker/workerdeployment/version_workflow.go#L261)
was never executed!)

*How did I verify that this requires a patch*:

- Ran the replay test script before the fix introduced in this PR and
noticed:
<img width="793" height="79" alt="image"
src="https://github.com/user-attachments/assets/bb975d26-4ad1-4847-b477-b74c7d2f55e4"
/>

- Ran the replay test script after the fix introduced in this PR and
noticed:
<img width="911" height="99" alt="image"
src="https://github.com/user-attachments/assets/ad3ff3e1-666a-49b4-bda4-e05f3412d984"
/>

In other words, the total number of our entity workflows went up after
introducing this change and thus requires a patch gate!

- we found a bug in the versioning entity workflows and this is an
effort to fix this!

- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [x] added new functional test(s)

- yes, there is a risk of NDE's happening in production but I mitigated
that risk by placing the fix against a version gate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants