Make memberlist cluster rejoin dead nodes periodically#4491
Conversation
|
/test-all |
Codecov Report
@@ Coverage Diff @@
## main #4491 +/- ##
==========================================
+ Coverage 67.68% 67.93% +0.24%
==========================================
Files 402 402
Lines 57253 57283 +30
==========================================
+ Hits 38754 38917 +163
+ Misses 15805 15669 -136
- Partials 2694 2697 +3
|
xliuxu
left a comment
There was a problem hiding this comment.
The change LGTM.
I quickly went through the dead node handling in memberlist, noticed that there is a config called DeadNodeReclaimTime with the following definition.
// DeadNodeReclaimTime controls the time before a dead node's name can be
// reclaimed by one with a different address or port. By default, this is 0,
// meaning nodes cannot be reclaimed this way.
Do you think we should also config this value as non-zero? Otherwise, a 'dead' Node recovered with a different IP will still be unable to rejoin the cluster.
b61d680 to
cc5aa4f
Compare
The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted. Signed-off-by: Quan Tian <qtian@vmware.com>
cc5aa4f to
ab227da
Compare
Thanks for reminding it. I tested this scenario and saw the node was removed from memberlist after it's been dead for 15 seconds (the GossipToTheDeadTime we are using), after which rejoining the same Node with different IP works. But with setting DeadNodeReclaimTime to a smaller value could make them join sooner, I have updated it to a very small value. |
|
/test-all |
The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted. Signed-off-by: Quan Tian <qtian@vmware.com>
The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted.
Signed-off-by: Quan Tian qtian@vmware.com