Support Dynamic Cluster IP Addresses in Failure Scenarios #495

mastermanu · 2020-07-02T17:52:05Z

What changed?
Adds a retry mechanism during RingPop Bootstrap where if we encounter a bootstrap failure, we retry up to 5 more times before crashing the process, refreshing the bootstrap list prior to each retry.

We suspect (and were able to repro on onebox) that the node is unable to join a ringpop cluster if all of the supplied seed nodes are invalid.

Background: Our bootstrap logic relies on nodes in a Temporal cluster writing their Host:Ports periodically to a table. In the case of a cluster that is cold-starting, all of those written IP addresses may no longer be valid, so no node would be able to start until those heartbeats expire.

Furthermore, the node would write its own heartbeat, fail to start, immediately recycle and potentially get a new IP address meaning that the heartbeat it just wrote is no longer valid, which will negatively impact other nodes (and itself) the same way. This means that the situation could never stabilize.

This fix will retry refreshing the bootstrap list and joining the RingPop cluster without recycling the process up to 5 additional times. The node will continue to write its heartbeats during this process. This basically increases the window of time that this node is discoverable by other nodes (and vice-versa) and ensures that our retries are using the freshest bootstrap list possible.

Because this issue reproduces on onebox, we were able to write unit tests and test locally to verify that the retry logic works and that bootstrap can be invoked on the same ringpop object multiple times without any feature of repercussion (its internal initialization code is also idempotent). We also inspected the ringpop library code to validate that 1) our understanding of the problem is correct and 2) multiple bootstrap retries would work. This has not explicitly been verified on staging, but can be done after the merge to master given the low risks.

The risk here is substantially low - this is addressing a situation where the cluster degenerates into an unstable state. It does not affect the happy path (e.g. first-time startup, single-node cluster startup, stable cluster startup). In the worst case, this fix doesn't solve the problem and the cluster is still unhealthy and fails to start.

CLAassistant · 2020-07-02T17:52:11Z

All committers have signed the CLA.

shawnhathaway

The node will continue to write its heartbeats during this process

Great!

shawnhathaway · 2020-07-02T18:27:01Z

common/membership/rp_cluster_test.go

+			res := &persistence.GetClusterMembersResponse{ActiveMembers: []*persistence.ClusterMember{seedMember}}
+
+			if firstGetClusterMemberCall {
+				// The first time GetClusterMembers is invoked, we simulate returning a stale/bad heartbeat.


…fix_bootstrap_issue

samarabbas · 2020-07-03T00:54:50Z

common/membership/ringpop.go

+}
+
+func (r *RingPop) bootstrap(
+	initialBootstrapHostPorts []string,


Why initialBootstrapHostPorts is specifically passed in?
I think it makes the logic simpler if we just retrieve hostPorts from DB on each attempt including first attempt.

great point. updated

Awesome. Update looks good. Much cleaner.

mastermanu added 3 commits July 1, 2020 21:30

initial checkin

d882378

adds retries if initial bootstrap attempt fails

78303d2

slight cleanup

83e7b28

mastermanu linked an issue Jul 2, 2020 that may be closed by this pull request

Support Dynamic Cluster IP Addresses in Failure Scenarios #493

Closed

shawnhathaway reviewed Jul 2, 2020

View reviewed changes

shawnhathaway self-requested a review July 2, 2020 18:28

shawnhathaway approved these changes Jul 2, 2020

View reviewed changes

Merge branch 'master' of https://github.com/temporalio/temporal into …

c4cf1b2

…fix_bootstrap_issue

mastermanu requested a review from samarabbas July 2, 2020 19:43

samarabbas reviewed Jul 3, 2020

View reviewed changes

samarabbas approved these changes Jul 3, 2020

View reviewed changes

addresses CR comments to simpliffy bootstrap start API

2e8d7ab

mastermanu merged commit 1d4a36c into temporalio:master Jul 3, 2020

shawnhathaway linked an issue Jul 15, 2020 that may be closed by this pull request

Ringpop issues with docker swarm mode #442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Dynamic Cluster IP Addresses in Failure Scenarios #495

Support Dynamic Cluster IP Addresses in Failure Scenarios #495

Uh oh!

mastermanu commented Jul 2, 2020

Uh oh!

CLAassistant commented Jul 2, 2020 •

edited

Loading

Uh oh!

shawnhathaway left a comment

Uh oh!

shawnhathaway Jul 2, 2020

Uh oh!

samarabbas Jul 3, 2020

Uh oh!

mastermanu Jul 3, 2020

Uh oh!

samarabbas Jul 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support Dynamic Cluster IP Addresses in Failure Scenarios #495

Support Dynamic Cluster IP Addresses in Failure Scenarios #495

Uh oh!

Conversation

mastermanu commented Jul 2, 2020

Uh oh!

CLAassistant commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shawnhathaway left a comment

Choose a reason for hiding this comment

Uh oh!

shawnhathaway Jul 2, 2020

Choose a reason for hiding this comment

Uh oh!

samarabbas Jul 3, 2020

Choose a reason for hiding this comment

Uh oh!

mastermanu Jul 3, 2020

Choose a reason for hiding this comment

Uh oh!

samarabbas Jul 3, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Jul 2, 2020 •

edited

Loading