Skip to content

Conversation

@dnr
Copy link
Contributor

@dnr dnr commented Jul 7, 2023

What changed?

  • If the user data fetch fails because the parent partition is running an older version, act as if it returned no data.
  • If the user data fetch fails because the parent partition has user data disabled, disable user data on this partition too.
  • Split the "load from db" and "fetch from parent partition" into separate functions for clarity.
  • Even if LoadUserData switch is flipped off, keep the load/fetch goroutine running and checking LoadUserData periodically so that we can resume if it's flipped back on.

Why?

  • During initial deployment of 1.21 over 1.20, matching can see Unimplemented errors if child partition is upgraded before parent partition.
  • Similarly, during flip of LoadUserData on/off, partitions of one task queue might see different values, so we should propagate the disable.

How did you test it?
existing tests, new tests

Potential risks

Is hotfix candidate?

@dnr dnr requested a review from a team as a code owner July 7, 2023 21:52
Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, I didn't get a chance to thoroughly review the tests but the tests I thought of (enable / disable) transitions are there.


firstCall := true
// hasFetchedUserData is true if we have gotten a successful reply to GetTaskQueueUserData.
// It's used to control whether we do a long poll or a simple get.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think long poll is a bit confusing here, especially since it's a one-of-a-kind concept in server. I think let's just elaborate on what exactly this is. Like, a long poll on a task queue makes sense because you're waiting for a task, but a long poll on something that seems like static data (metadata of a task queue), doesn't immediatley make sense. I'd explain that we're waiting for any updates to the user data, and that we do that by sending a request with our current version that blocks until the user data on the server is updated to a higher version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"long poll" is a general concept for waiting for any sort of new data, and this does fit the pattern. It is used elsewhere in the server and referred to by that name, e.g. getting history and waiting for new events (

// if caller decide to long poll on workflow execution
and elsewhere)

knownUserData, _, err := c.GetUserData(ctx)
if err != nil {
return err
// Start with a non-long poll after re-enabling after disable, so that we don't have to wait the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase, "non-long poll after re-enabling after disable," is pretty dense. Also, I'm not sure it's relevant. I think all we need to express here is that, if we have user data, we want to wait for changes to it; if we don't have any user data, we want to fetch the latest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that's one way to look at it. I think the only subtle part is that we may have user data, then get disabled, then get re-enabled. in that case hasFetchedUserData would be true here, so we have to explicitly set it to false. I'm explaining why we explicitly set it to false

initialRangeID = 1 // Id of the first range of a new task queue
stickyTaskQueueTTL = 24 * time.Hour

userDataEnabled userDataEnabledState = iota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: userDataState instead of userDataEnabledState

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 50 to 52
userDataEnabled userDataEnabledState = iota
userDataDisabled
userDataSpecificVersion
Copy link
Contributor

@MichaelSnowden MichaelSnowden Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should disambiguate these. User data could be:

  • Enabled
  • Disabled for this tqm because it manages a specific version set (so it will always be disabled)
  • Disabled due to a setting on this node or a response it received from a parent (but it could be re-enabled in the future)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what those are. I'll add comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants