Skip to content

Conversation

@mastermanu
Copy link
Member

What changed?
Adds a retry mechanism during RingPop Bootstrap where if we encounter a bootstrap failure, we retry up to 5 more times before crashing the process, refreshing the bootstrap list prior to each retry.

We suspect (and were able to repro on onebox) that the node is unable to join a ringpop cluster if all of the supplied seed nodes are invalid.

Background: Our bootstrap logic relies on nodes in a Temporal cluster writing their Host:Ports periodically to a table. In the case of a cluster that is cold-starting, all of those written IP addresses may no longer be valid, so no node would be able to start until those heartbeats expire.

Furthermore, the node would write its own heartbeat, fail to start, immediately recycle and potentially get a new IP address meaning that the heartbeat it just wrote is no longer valid, which will negatively impact other nodes (and itself) the same way. This means that the situation could never stabilize.

This fix will retry refreshing the bootstrap list and joining the RingPop cluster without recycling the process up to 5 additional times. The node will continue to write its heartbeats during this process. This basically increases the window of time that this node is discoverable by other nodes (and vice-versa) and ensures that our retries are using the freshest bootstrap list possible.

Because this issue reproduces on onebox, we were able to write unit tests and test locally to verify that the retry logic works and that bootstrap can be invoked on the same ringpop object multiple times without any feature of repercussion (its internal initialization code is also idempotent). We also inspected the ringpop library code to validate that 1) our understanding of the problem is correct and 2) multiple bootstrap retries would work. This has not explicitly been verified on staging, but can be done after the merge to master given the low risks.

The risk here is substantially low - this is addressing a situation where the cluster degenerates into an unstable state. It does not affect the happy path (e.g. first-time startup, single-node cluster startup, stable cluster startup). In the worst case, this fix doesn't solve the problem and the cluster is still unhealthy and fails to start.

@CLAassistant
Copy link

CLAassistant commented Jul 2, 2020

CLA assistant check
All committers have signed the CLA.

@mastermanu mastermanu linked an issue Jul 2, 2020 that may be closed by this pull request
Copy link
Contributor

@shawnhathaway shawnhathaway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node will continue to write its heartbeats during this process

Great!

res := &persistence.GetClusterMembersResponse{ActiveMembers: []*persistence.ClusterMember{seedMember}}

if firstGetClusterMemberCall {
// The first time GetClusterMembers is invoked, we simulate returning a stale/bad heartbeat.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shawnhathaway shawnhathaway self-requested a review July 2, 2020 18:28
@mastermanu mastermanu requested a review from samarabbas July 2, 2020 19:43
}

func (r *RingPop) bootstrap(
initialBootstrapHostPorts []string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why initialBootstrapHostPorts is specifically passed in?
I think it makes the logic simpler if we just retrieve hostPorts from DB on each attempt including first attempt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point. updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Update looks good. Much cleaner.

@mastermanu mastermanu merged commit 1d4a36c into temporalio:master Jul 3, 2020
@shawnhathaway shawnhathaway linked an issue Jul 15, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Dynamic Cluster IP Addresses in Failure Scenarios Ringpop issues with docker swarm mode

4 participants