-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Support Dynamic Cluster IP Addresses in Failure Scenarios #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
shawnhathaway
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node will continue to write its heartbeats during this process
Great!
| res := &persistence.GetClusterMembersResponse{ActiveMembers: []*persistence.ClusterMember{seedMember}} | ||
|
|
||
| if firstGetClusterMemberCall { | ||
| // The first time GetClusterMembers is invoked, we simulate returning a stale/bad heartbeat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…fix_bootstrap_issue
common/membership/ringpop.go
Outdated
| } | ||
|
|
||
| func (r *RingPop) bootstrap( | ||
| initialBootstrapHostPorts []string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why initialBootstrapHostPorts is specifically passed in?
I think it makes the logic simpler if we just retrieve hostPorts from DB on each attempt including first attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point. updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Update looks good. Much cleaner.
What changed?
Adds a retry mechanism during RingPop Bootstrap where if we encounter a bootstrap failure, we retry up to 5 more times before crashing the process, refreshing the bootstrap list prior to each retry.
We suspect (and were able to repro on onebox) that the node is unable to join a ringpop cluster if all of the supplied seed nodes are invalid.
Background: Our bootstrap logic relies on nodes in a Temporal cluster writing their Host:Ports periodically to a table. In the case of a cluster that is cold-starting, all of those written IP addresses may no longer be valid, so no node would be able to start until those heartbeats expire.
Furthermore, the node would write its own heartbeat, fail to start, immediately recycle and potentially get a new IP address meaning that the heartbeat it just wrote is no longer valid, which will negatively impact other nodes (and itself) the same way. This means that the situation could never stabilize.
This fix will retry refreshing the bootstrap list and joining the RingPop cluster without recycling the process up to 5 additional times. The node will continue to write its heartbeats during this process. This basically increases the window of time that this node is discoverable by other nodes (and vice-versa) and ensures that our retries are using the freshest bootstrap list possible.
Because this issue reproduces on onebox, we were able to write unit tests and test locally to verify that the retry logic works and that bootstrap can be invoked on the same ringpop object multiple times without any feature of repercussion (its internal initialization code is also idempotent). We also inspected the ringpop library code to validate that 1) our understanding of the problem is correct and 2) multiple bootstrap retries would work. This has not explicitly been verified on staging, but can be done after the merge to master given the low risks.
The risk here is substantially low - this is addressing a situation where the cluster degenerates into an unstable state. It does not affect the happy path (e.g. first-time startup, single-node cluster startup, stable cluster startup). In the worst case, this fix doesn't solve the problem and the cluster is still unhealthy and fails to start.