-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
When using a pre-existing VPC, a pre-existing security group, and an unmanaged Windows nodeGroup, the CloudFormation Stack for the Windows NodeGroup creates a security group with the same ID as the pre-existing security group, replacing its settings #3811
Comments
I see my insanely long title and I feel no regrets. Some shame, but no regrets. |
Thanks for the detailed repro @kevin-lindsay-1, we can forgive you for the long title 😉 . Don't suppose you had ever tried this with an earlier version? No worries if not, just curious if this is a new thing which broke or if it has always been like this.
Just confirming did you mean |
I only tested it on
I think you're right. |
Initial investigation for broader reproducing stepsUsing an existing vpc with subnets and a security group with the following configuration: apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: test-overwriting
region: us-west-2
version: '1.20'
vpc:
id: vpc-0af9f9bdf9623ecc5
securityGroup: sg-0f615c936eb08b668
subnets:
private:
test-overwrite-1:
id: subnet-0bd09a3712e886998
test-overwrite-2:
id: subnet-0d8d04c4087d7fb09
clusterEndpoints:
privateAccess: true
publicAccess: false
addons:
- name: coredns
version: v1.8.3-eksbuild.1
- name: kube-proxy
version: v1.20.4-eksbuild.2
- name: vpc-cni
version: v1.7.10-eksbuild.1 Trying to not care about nodegroups for a bit. This created a separate security group and added that to the cluster security group and additional security group correctly. Going for the narrower case now with unmanaged nodegroup but doesn't matter what's the OS. |
@kevin-lindsay-1 quick question... At what point did you notice that it overwrote that security group? Was it immediately or only after a couple minutes when it started to spin up the cluster? |
@Skarlso pretty sure it was effectively immediate. I was using unmanaged windows node groups, which requires special sg entries, which I think is where the problem may be coming from; I think the portion of the CFN output json/yaml for the sg entries for the windows node groups are accidentally creating a new sg with the same name as the one that exists, rather than referencing and adding rules. |
Thanks for the extra info! |
@kevin-lindsay-1 Just to clarify... It didn't create a cluster security group and you didn't have your own security group as |
@kevin-lindsay-1 Hi. :) So... I tried reproducing this. My current eksctl version:
I'm using a clean build, but it's 5 versions after yours. I couldn't notice any commits which should have influenced this. This is the config I was using: apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: test-overwriting-5
region: us-west-2
version: '1.20'
addons:
- name: coredns
version: v1.8.3-eksbuild.1
- name: kube-proxy
version: v1.20.4-eksbuild.2
- name: vpc-cni
version: v1.7.10-eksbuild.1
vpc:
id: vpc-0af9f9bdf9623ecc5
securityGroup: sg-0f615c936eb08b668
subnets:
private:
us-west-2b:
id: subnet-0bd09a3712e886998
us-west-2a:
id: subnet-0d8d04c4087d7fb09
managedNodeGroups:
- name: linux-ng
amiFamily: AmazonLinux2
instanceType: t2.large
minSize: 2
maxSize: 3
nodeGroups:
- name: windows-ng
amiFamily: WindowsServer2019FullContainer
minSize: 2
maxSize: 3 My original security group was untouched but added as an additional security group to my cluster: This is the command I run:
I didn't use a Would you mind trying the same thing with the latest eksctl please to see if there is something I'm missing? Or some kind of setting I'm missing maybe? Thanks! |
@kevin-lindsay-1 Okay! I actually got something. What made things do something is that I had to add this to the nodeGroup:
Because my whole deployment was not private, it didn't change anything before. So to re-iterate. You had a SG which had some routings which completely disappeared? This is what happened to mine: The rules got expanded. I added 2999 that was the pre-existing rule. And it's still there. |
I also tried with 0.52.0 and my SG retained all inbound and outbound rules and they weren't replaced. So there must be something else that's happening here. How are you contacting the cluster if VPC is set to endpoint private? Are you running from a specific machine with access I assume? |
I use an AWS VPN Client connection which is attached to the VPC. No jumpboxes, easy 1-click connection. Where are the rules for your Windows Node VPC Controllers? |
What rules? I don't have any. I'm just using a sample of the cluster create config that you provided. :) |
I think I followed the instructions and did Pretty hard to say exactly where it happened, because It's certainly possible that I just did something in/out of order, and you might be using a slightly different order of operations. |
Oh, sorry, I thought you meant that you had something custom. Yes, I did use |
It's possible that this bug would go away completely or rear its head more clearly if we "simply" had an |
You mean like this? :) |
Yep, pretty sure it's explicitly on the roadmap, too. |
Indeed. We'll get there eventually. :) |
Ok, so I've set up a VPN connection, have a vpc existing subnets and a sec-group. Launched a new cluster create with the attached config and now I'm waiting for things to happen. |
Yep, just to be clear:
|
Cool, thanks. I'm doing the same thing. All tries using 0.52? Can you possibly try 0.57? |
When I checked in CloudFormation, I saw a resource for the main SG, and it was in the |
And that was for the existing SG and not the Cluster SG that is created in addition? :O Huh, that is super weird. I see the ClusterSharedNodeSecurityGroup as CREATE_COMPLETE. This is the full cluster create event list: And the node groups created their own SGs which is linked in the SharedNodeSecurityGroup. |
I'm spinning up a new one right now for test purposes; I created some test rules inbound and outbound to see if they go away. |
I'm running on |
Uh, I messed something up and my windows instances would not join the cluster. :) Do you think that matters? The SG is already edited and looks fine. |
Mine's rolling out, we should be able to see if that's relevant. |
🤞 |
I have tested this setup using the commands and same exact config as the one used in the original post, and I no longer see the Previously the This sounds like a good time to close this issue; because of how persistent this issue was in As far as I'm concerned, I'm going to /shrug and call this "can no longer replicate", which is the same thing as being fixed, right? ;) |
Aye! :D Also, I finally succeeded in creating my cluster as well, and it seemed to work fine. I'm happy that this could be resolved. It was a fantastic journey for sure! Thank you for that! :) |
Yes, just meeting my monthly quota of "give someone a hyper-specific problem with software that is in heavy flux and possibly already fixed". |
Those are the best. 😂 |
What were you trying to accomplish?
Creating a cluster with the following requirements:
eksctl --profile=<profile-name> create cluster -f <cluster-name>.yaml --install-vpc-controllers
What happened?
Most everything went off without a hitch, but specifically when creating the Windows NodeGroup, the rules for allowing to communicate with the control plane override the pre-existing rules in the Security Group, specifically it behaves as a
replace
, not anadd
. This appears to be due to theiam.securityGroup
parameter for some reason treating that as a desired security group, rather than a referenced security group.What happened to me when testing this was that it completely removed outbound rules outside of the ones related to Windows communicating with the control plane, which then immediately caused the cluster to lose network access. I've tested this exact behavior about 10 times now in order to pinpoint the exact cause.
How to reproduce it?
The steps listed above should be enough to go off of; please let me know if not.
Logs
Anything else we need to know?
Running on MacOS, latest.
Versions
The text was updated successfully, but these errors were encountered: