-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Make memberlist cluster rejoin dead nodes periodically #4491
Conversation
/test-all |
Codecov Report
@@ Coverage Diff @@
## main #4491 +/- ##
==========================================
+ Coverage 67.68% 67.93% +0.24%
==========================================
Files 402 402
Lines 57253 57283 +30
==========================================
+ Hits 38754 38917 +163
+ Misses 15805 15669 -136
- Partials 2694 2697 +3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change LGTM.
I quickly went through the dead node handling in memberlist, noticed that there is a config called DeadNodeReclaimTime
with the following definition.
// DeadNodeReclaimTime controls the time before a dead node's name can be
// reclaimed by one with a different address or port. By default, this is 0,
// meaning nodes cannot be reclaimed this way.
Do you think we should also config this value as non-zero? Otherwise, a 'dead' Node recovered with a different IP will still be unable to rejoin the cluster.
b61d680
to
cc5aa4f
Compare
The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted. Signed-off-by: Quan Tian <qtian@vmware.com>
cc5aa4f
to
ab227da
Compare
Thanks for reminding it. I tested this scenario and saw the node was removed from memberlist after it's been dead for 15 seconds (the GossipToTheDeadTime we are using), after which rejoining the same Node with different IP works. But with setting DeadNodeReclaimTime to a smaller value could make them join sooner, I have updated it to a very small value. |
/test-all |
The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted. Signed-off-by: Quan Tian <qtian@vmware.com>
The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted.
Signed-off-by: Quan Tian qtian@vmware.com