Make memberlist cluster rejoin dead nodes periodically #4491

tnqn · 2022-12-18T06:14:18Z

The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted.

Signed-off-by: Quan Tian qtian@vmware.com

tnqn · 2022-12-18T06:14:30Z

/test-all

codecov · 2022-12-18T06:33:03Z

Codecov Report

Merging #4491 (ab227da) into main (c9c1b2c) will increase coverage by 0.24%.
The diff coverage is 85.50%.

@@            Coverage Diff             @@
##             main    #4491      +/-   ##
==========================================
+ Coverage   67.68%   67.93%   +0.24%     
==========================================
  Files         402      402              
  Lines       57253    57283      +30     
==========================================
+ Hits        38754    38917     +163     
+ Misses      15805    15669     -136     
- Partials     2694     2697       +3

Flag	Coverage Δ
e2e-tests	`38.23% <76.81%> (?)`
integration-tests	`34.65% <ø> (+0.01%)`	⬆️
kind-e2e-tests	`47.01% <76.81%> (+0.16%)`	⬆️
unit-tests	`56.41% <50.72%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pkg/agent/memberlist/cluster.go	`78.70% <85.50%> (+2.17%)`	⬆️
pkg/agent/cniserver/ipam/ipam_service.go	`83.14% <0.00%> (-8.99%)`	⬇️
pkg/agent/flowexporter/exporter/exporter.go	`71.53% <0.00%> (-8.67%)`	⬇️
pkg/agent/controller/networkpolicy/packetin.go	`70.27% <0.00%> (-6.76%)`	⬇️
pkg/controller/labelidentity/controller.go	`78.00% <0.00%> (-6.00%)`	⬇️
pkg/controller/networkpolicy/store/addressgroup.go	`88.37% <0.00%> (-3.49%)`	⬇️
pkg/util/ip/ip.go	`86.99% <0.00%> (-3.26%)`	⬇️
pkg/agent/controller/networkpolicy/reject.go	`73.39% <0.00%> (-2.96%)`	⬇️
...catesigningrequest/ipsec_csr_signing_controller.go	`61.65% <0.00%> (-2.46%)`	⬇️
pkg/agent/openflow/packetin.go	`74.19% <0.00%> (-1.62%)`	⬇️
... and 31 more

xliuxu

The change LGTM.
I quickly went through the dead node handling in memberlist, noticed that there is a config called DeadNodeReclaimTime with the following definition.

// DeadNodeReclaimTime controls the time before a dead node's name can be
// reclaimed by one with a different address or port. By default, this is 0,
// meaning nodes cannot be reclaimed this way.

Do you think we should also config this value as non-zero? Otherwise, a 'dead' Node recovered with a different IP will still be unable to rejoin the cluster.

The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted. Signed-off-by: Quan Tian <qtian@vmware.com>

tnqn · 2022-12-18T12:54:12Z

Do you think we should also config this value as non-zero? Otherwise, a 'dead' Node recovered with a different IP will still be unable to rejoin the cluster.

Thanks for reminding it. I tested this scenario and saw the node was removed from memberlist after it's been dead for 15 seconds (the GossipToTheDeadTime we are using), after which rejoining the same Node with different IP works. But with setting DeadNodeReclaimTime to a smaller value could make them join sooner, I have updated it to a very small value.

tnqn · 2022-12-18T12:55:07Z

/test-all

The patch periodically rejoins Nodes that were removed from the member list by memberlist because they were unreachable for more than 15 seconds (the GossipToTheDeadTime we are using). Without it, once there is a network downtime lasting more than 15 seconds, the agent wouldn't try to reach any other Node and would think it's the only alive Node until it's restarted. Signed-off-by: Quan Tian <qtian@vmware.com>

tnqn added area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Dec 18, 2022

tnqn added this to the Antrea v1.10 release milestone Dec 18, 2022

tnqn requested review from wenqiq and xliuxu December 18, 2022 06:14

xliuxu previously approved these changes Dec 18, 2022

View reviewed changes

tnqn dismissed xliuxu’s stale review via cc5aa4f December 18, 2022 12:37

tnqn force-pushed the fix-memberlist-node branch from b61d680 to cc5aa4f Compare December 18, 2022 12:37

tnqn force-pushed the fix-memberlist-node branch from cc5aa4f to ab227da Compare December 18, 2022 12:52

xliuxu approved these changes Dec 18, 2022

View reviewed changes

tnqn merged commit 18011b8 into antrea-io:main Dec 18, 2022

tnqn deleted the fix-memberlist-node branch December 18, 2022 15:24

tnqn mentioned this pull request Dec 18, 2022

Automated cherry pick of #4491: Make memberlist cluster rejoin dead nodes periodically #4492

Merged

This was referenced Jan 3, 2023

Automated cherry pick of #4491: Make memberlist cluster rejoin dead nodes periodically #4527

Merged

Automated cherry pick of #4491: Make memberlist cluster rejoin dead nodes periodically #4528

Merged

tnqn mentioned this pull request Feb 3, 2023

CLI command for getting memberlist state #4601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make memberlist cluster rejoin dead nodes periodically #4491

Make memberlist cluster rejoin dead nodes periodically #4491

tnqn commented Dec 18, 2022 •

edited

Loading

tnqn commented Dec 18, 2022

codecov bot commented Dec 18, 2022 •

edited

Loading

xliuxu left a comment

tnqn commented Dec 18, 2022

tnqn commented Dec 18, 2022

Make memberlist cluster rejoin dead nodes periodically #4491

Make memberlist cluster rejoin dead nodes periodically #4491

Conversation

tnqn commented Dec 18, 2022 • edited Loading

tnqn commented Dec 18, 2022

codecov bot commented Dec 18, 2022 • edited Loading

Codecov Report

xliuxu left a comment

Choose a reason for hiding this comment

tnqn commented Dec 18, 2022

tnqn commented Dec 18, 2022

tnqn commented Dec 18, 2022 •

edited

Loading

codecov bot commented Dec 18, 2022 •

edited

Loading