Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make memberlist cluster rejoin dead nodes periodically #4491

Merged
merged 1 commit into from
Dec 18, 2022

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Dec 18, 2022

The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted.

Signed-off-by: Quan Tian qtian@vmware.com

@tnqn tnqn added area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Dec 18, 2022
@tnqn tnqn added this to the Antrea v1.10 release milestone Dec 18, 2022
@tnqn tnqn requested review from wenqiq and xliuxu December 18, 2022 06:14
@tnqn
Copy link
Member Author

tnqn commented Dec 18, 2022

/test-all

@codecov
Copy link

codecov bot commented Dec 18, 2022

Codecov Report

Merging #4491 (ab227da) into main (c9c1b2c) will increase coverage by 0.24%.
The diff coverage is 85.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4491      +/-   ##
==========================================
+ Coverage   67.68%   67.93%   +0.24%     
==========================================
  Files         402      402              
  Lines       57253    57283      +30     
==========================================
+ Hits        38754    38917     +163     
+ Misses      15805    15669     -136     
- Partials     2694     2697       +3     
Flag Coverage Δ
e2e-tests 38.23% <76.81%> (?)
integration-tests 34.65% <ø> (+0.01%) ⬆️
kind-e2e-tests 47.01% <76.81%> (+0.16%) ⬆️
unit-tests 56.41% <50.72%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pkg/agent/memberlist/cluster.go 78.70% <85.50%> (+2.17%) ⬆️
pkg/agent/cniserver/ipam/ipam_service.go 83.14% <0.00%> (-8.99%) ⬇️
pkg/agent/flowexporter/exporter/exporter.go 71.53% <0.00%> (-8.67%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 70.27% <0.00%> (-6.76%) ⬇️
pkg/controller/labelidentity/controller.go 78.00% <0.00%> (-6.00%) ⬇️
pkg/controller/networkpolicy/store/addressgroup.go 88.37% <0.00%> (-3.49%) ⬇️
pkg/util/ip/ip.go 86.99% <0.00%> (-3.26%) ⬇️
pkg/agent/controller/networkpolicy/reject.go 73.39% <0.00%> (-2.96%) ⬇️
...catesigningrequest/ipsec_csr_signing_controller.go 61.65% <0.00%> (-2.46%) ⬇️
pkg/agent/openflow/packetin.go 74.19% <0.00%> (-1.62%) ⬇️
... and 31 more

xliuxu
xliuxu previously approved these changes Dec 18, 2022
Copy link
Contributor

@xliuxu xliuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM.
I quickly went through the dead node handling in memberlist, noticed that there is a config called DeadNodeReclaimTime with the following definition.

// DeadNodeReclaimTime controls the time before a dead node's name can be
// reclaimed by one with a different address or port. By default, this is 0,
// meaning nodes cannot be reclaimed this way.

Do you think we should also config this value as non-zero? Otherwise, a 'dead' Node recovered with a different IP will still be unable to rejoin the cluster.

The patch periodically rejoins Nodes that were removed from the member
list by memberlist because they were unreachable for more than 15
seconds (the GossipToTheDeadTime we are using). Without it, once there
is a network downtime lasting more than 15 seconds, the agent wouldn't
try to reach any other Node and would think it's the only alive Node
until it's restarted.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn
Copy link
Member Author

tnqn commented Dec 18, 2022

Do you think we should also config this value as non-zero? Otherwise, a 'dead' Node recovered with a different IP will still be unable to rejoin the cluster.

Thanks for reminding it. I tested this scenario and saw the node was removed from memberlist after it's been dead for 15 seconds (the GossipToTheDeadTime we are using), after which rejoining the same Node with different IP works. But with setting DeadNodeReclaimTime to a smaller value could make them join sooner, I have updated it to a very small value.

@tnqn
Copy link
Member Author

tnqn commented Dec 18, 2022

/test-all

@tnqn tnqn merged commit 18011b8 into antrea-io:main Dec 18, 2022
@tnqn tnqn deleted the fix-memberlist-node branch December 18, 2022 15:24
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Jan 27, 2023
The patch periodically rejoins Nodes that were removed from the member
list by memberlist because they were unreachable for more than 15
seconds (the GossipToTheDeadTime we are using). Without it, once there
is a network downtime lasting more than 15 seconds, the agent wouldn't
try to reach any other Node and would think it's the only alive Node
until it's restarted.

Signed-off-by: Quan Tian <qtian@vmware.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants