Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

antrea "Clustering" is not OS aware #5431

Closed
jayunit100 opened this issue Aug 23, 2023 · 2 comments · Fixed by #5434
Closed

antrea "Clustering" is not OS aware #5431

jayunit100 opened this issue Aug 23, 2023 · 2 comments · Fixed by #5434
Assignees
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. reported-by/end-user Issues reported by end users.

Comments

@jayunit100
Copy link
Contributor

Describe the bug

found some ominous logs today with @aroradaman while debugging a mixed OS linux/windows cluster.


I0823 11:49:58.077143       1 egress_controller.go:819] Stopped watch for EgressGroup, total items received: 0
I0823 11:49:58.077172       1 egress_controller.go:795] Starting watch for EgressGroup
I0823 11:49:58.081636       1 egress_controller.go:816] Started watch for EgressGroup
I0823 11:49:58.081795       1 egress_controller.go:844] Received 0 init events for EgressGroup
I0823 11:50:15.410463       1 networkpolicy_controller.go:778] Stopped watch for AppliedToGroup, total items received: 0
I0823 11:50:15.410493       1 networkpolicy_controller.go:761] Starting watch for AppliedToGroup
I0823 11:50:15.414468       1 networkpolicy_controller.go:774] Started watch for AppliedToGroup
I0823 11:50:15.414531       1 networkpolicy_controller.go:804] Received 0 init events for AppliedToGroup
E0823 11:50:37.547900       1 cluster.go:379] "Failed to rejoin any members" err=<
        2 errors occurred:
                * Failed to join 10.221.159.239:10351: dial tcp 10.221.159.239:10351: connect: connection refused
                * Failed to join 10.221.159.223:10351: dial tcp 10.221.159.223:10351: connect: connection refused

 > members=[10.221.159.239 10.221.159.223]
E0823 11:51:37.547873       1 cluster.go:379] "Failed to rejoin any members" err=<
        2 errors occurred:
                * Failed to join 10.221.159.239:10351: dial tcp 10.221.159.239:10351: connect: connection refused
                * Failed to join 10.221.159.223:10351: dial tcp 10.221.159.223:10351: connect: connection refused
                *

To Reproduce

Create multiple OS types , some which dont support Egressing isnt supported on Windows, thus, the port 10351 isnt opened up.

Expected
antrea agent would be more cautious about what nodes it tried to cluster with.

Actual behavior

some scary log messages about failures occur which can make it difficult to debug other networking issues that might be going on in mixed OS clusters, which are common.

@jayunit100 jayunit100 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 23, 2023
@jayunit100 jayunit100 changed the title antrea "Clustering" doesnt is not OS aware antrea "Clustering" is not OS aware Aug 23, 2023
@antoninbas
Copy link
Contributor

antoninbas commented Aug 23, 2023

Thanks for reporting this @jayunit100. Your analysis looks correct to me.

I think we have multiple options to fix this:

  1. Never join non-Linux Nodes to the cluster. This is very simple to do, and at the moment the membership cluster is only used for features which are only available for Linux Nodes.
  2. Join all Nodes to the cluster (i.e., we need to also run the clustering on Windows Nodes), and filter Nodes by OS when selecting a candidate Node with consistent hashing. This makes sense if we plan to have a subset of features which require clustering available on Windows in the future.

1) is clearly the easiest option right now, while 2) may be the better long-term solution.

I'll let @tnqn comment & decide since he is more familiar with the feature.

@antoninbas antoninbas added area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 23, 2023
@antoninbas antoninbas added this to the Antrea v1.14 release milestone Aug 23, 2023
@tnqn
Copy link
Member

tnqn commented Aug 24, 2023

Thanks @jayunit100 @antoninbas. Since antrea-agent‘s strategy is running memberlist on demand (when it's needed by any feature), I think it makes more sense to go with the 1st, otherwise it would be a bit complex to decide whether memberlist should be started, or we need to change the strategy to running memberlist unconditionally, which would then incur unnecessary overhead to some scenarios: PolicyOnly, ExternalNode. #5434 implements the 1st.

@tnqn tnqn added the reported-by/end-user Issues reported by end users. label Dec 11, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants