Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

When using a pre-existing VPC, a pre-existing security group, and an unmanaged Windows nodeGroup, the CloudFormation Stack for the Windows NodeGroup creates a security group with the same ID as the pre-existing security group, replacing its settings #3811

Closed
kevin-lindsay-1 opened this issue Jun 4, 2021 · 33 comments

Comments

@kevin-lindsay-1
Copy link

kevin-lindsay-1 commented Jun 4, 2021

What were you trying to accomplish?

Creating a cluster with the following requirements:

  1. Use a pre-existing VPC, subnets, and security groups (luckily, I happen to be testing this behavior before trying to apply this to a production VPC)
  2. Have the VPC use a private endpoint only
  3. Use an AWS VPN Client endpoint to reach the VPC and the k8s control plane
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: <cluster-name>
  region: us-east-1
  version: '1.20'

secretsEncryption:
  keyARN: <key>

vpc:
  id: <pre-existing VPC>
  subnets:
    private:
      <pre-existing subnets>
    public:
      <pre-existing subnets>
  securityGroup: <pre-existing SG ID> # not an ARN
  clusterEndpoints:
    privateAccess: true
    publicAccess: false

addons:
- name: coredns
  version: v1.8.3-eksbuild.1
- name: kube-proxy
  version: v1.20.4-eksbuild.2
- name: vpc-cni
  version: v1.7.10-eksbuild.1

managedNodeGroups:
- name: linux
  amiFamily: AmazonLinux2
  ...
  iam:
    instanceRoleARN: <node-role>

nodeGroups:
- name: windows
  amiFamily: WindowsServer2019FullContainer
  ...
  iam:
    instanceRoleARN: <node-role>

eksctl --profile=<profile-name> create cluster -f <cluster-name>.yaml --install-vpc-controllers

What happened?

Most everything went off without a hitch, but specifically when creating the Windows NodeGroup, the rules for allowing to communicate with the control plane override the pre-existing rules in the Security Group, specifically it behaves as a replace, not an add. This appears to be due to the iam.securityGroup parameter for some reason treating that as a desired security group, rather than a referenced security group.

What happened to me when testing this was that it completely removed outbound rules outside of the ones related to Windows communicating with the control plane, which then immediately caused the cluster to lose network access. I've tested this exact behavior about 10 times now in order to pinpoint the exact cause.

How to reproduce it?

The steps listed above should be enough to go off of; please let me know if not.

Logs

Anything else we need to know?

Running on MacOS, latest.

Versions

$ eksctl version
0.52.0

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.4-eks-6b7464", GitCommit:"6b746440c04cb81db4426842b4ae65c3f7035e53", GitTreeState:"clean", BuildDate:"2021-03-19T19:33:03Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
@kevin-lindsay-1 kevin-lindsay-1 changed the title When using a pre-existing VPC, a pre-existing security group, and an unmanaged Windows nodeGroup, CloudFormation creates a security group with the same ID as the pre-existing security group, overriding its settings When using a pre-existing VPC, a pre-existing security group, and an unmanaged Windows nodeGroup, the CloudFormation Stack for the Windows NodeGroup creates a security group with the same ID as the pre-existing security group, overriding its settings Jun 4, 2021
@kevin-lindsay-1 kevin-lindsay-1 changed the title When using a pre-existing VPC, a pre-existing security group, and an unmanaged Windows nodeGroup, the CloudFormation Stack for the Windows NodeGroup creates a security group with the same ID as the pre-existing security group, overriding its settings When using a pre-existing VPC, a pre-existing security group, and an unmanaged Windows nodeGroup, the CloudFormation Stack for the Windows NodeGroup creates a security group with the same ID as the pre-existing security group, replacing its settings Jun 4, 2021
@kevin-lindsay-1
Copy link
Author

I see my insanely long title and I feel no regrets. Some shame, but no regrets.

@Callisto13
Copy link
Contributor

Thanks for the detailed repro @kevin-lindsay-1, we can forgive you for the long title 😉 .

Don't suppose you had ever tried this with an earlier version? No worries if not, just curious if this is a new thing which broke or if it has always been like this.

This appears to be due to the iam.securityGroup parameter for some reason treating that as a desired security group, rather than a referenced security group.

Just confirming did you mean vpc.securityGroup not iam?

@kevin-lindsay-1
Copy link
Author

Don't suppose you had ever tried this with an earlier version?

I only tested it on 0.52.0, as I just started using it and have no other baseline. As far as I know, it's not a regression.

Just confirming did you mean vpc.securityGroup not iam?

I think you're right.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 8, 2021

Initial investigation for broader reproducing steps

Screenshot 2021-07-08 at 15 52 43

Using an existing vpc with subnets and a security group with the following configuration:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-overwriting
  region: us-west-2
  version: '1.20'

vpc:
  id: vpc-0af9f9bdf9623ecc5
  securityGroup: sg-0f615c936eb08b668
  subnets:
    private:
      test-overwrite-1:
        id: subnet-0bd09a3712e886998
      test-overwrite-2:
        id: subnet-0d8d04c4087d7fb09
  clusterEndpoints:
    privateAccess: true
    publicAccess: false

  
addons:
- name: coredns
  version: v1.8.3-eksbuild.1
- name: kube-proxy
  version: v1.20.4-eksbuild.2
- name: vpc-cni
  version: v1.7.10-eksbuild.1

Trying to not care about nodegroups for a bit.

This created a separate security group and added that to the cluster security group and additional security group correctly. Going for the narrower case now with unmanaged nodegroup but doesn't matter what's the OS.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 8, 2021

@kevin-lindsay-1 quick question... At what point did you notice that it overwrote that security group? Was it immediately or only after a couple minutes when it started to spin up the cluster?

@kevin-lindsay-1
Copy link
Author

@Skarlso pretty sure it was effectively immediate.

I was using unmanaged windows node groups, which requires special sg entries, which I think is where the problem may be coming from; I think the portion of the CFN output json/yaml for the sg entries for the windows node groups are accidentally creating a new sg with the same name as the one that exists, rather than referencing and adding rules.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Thanks for the extra info!

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

@kevin-lindsay-1 Just to clarify... It didn't create a cluster security group and you didn't have your own security group as Additional Security Group? But it only changed your SG right away after hitting create?

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

@kevin-lindsay-1 Hi. :)

So... I tried reproducing this. My current eksctl version:

eksctl version
0.57.0-dev+eadbb9b7.2021-07-08T12:01:58Z

I'm using a clean build, but it's 5 versions after yours. I couldn't notice any commits which should have influenced this. This is the config I was using:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-overwriting-5
  region: us-west-2
  version: '1.20'

addons:
- name: coredns
  version: v1.8.3-eksbuild.1
- name: kube-proxy
  version: v1.20.4-eksbuild.2
- name: vpc-cni
  version: v1.7.10-eksbuild.1

vpc:
  id: vpc-0af9f9bdf9623ecc5
  securityGroup: sg-0f615c936eb08b668
  subnets:
    private:
      us-west-2b:
        id: subnet-0bd09a3712e886998
      us-west-2a:
        id: subnet-0d8d04c4087d7fb09

managedNodeGroups:
  - name: linux-ng
    amiFamily: AmazonLinux2
    instanceType: t2.large
    minSize: 2
    maxSize: 3

nodeGroups:
  - name: windows-ng
    amiFamily: WindowsServer2019FullContainer
    minSize: 2
    maxSize: 3

My original security group was untouched but added as an additional security group to my cluster:

Screenshot 2021-07-09 at 12 23 10

This is the command I run:

 eksctl create cluster -f cluster.yaml  --install-vpc-controllers

I didn't use a --profile but that shouldn't be a problem, unless you think otherwise.

Would you mind trying the same thing with the latest eksctl please to see if there is something I'm missing? Or some kind of setting I'm missing maybe?

Thanks!

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

@kevin-lindsay-1 Okay! I actually got something. What made things do something is that I had to add this to the nodeGroup:

    privateNetworking: true

Because my whole deployment was not private, it didn't change anything before.

So to re-iterate. You had a SG which had some routings which completely disappeared?

This is what happened to mine:

Screenshot 2021-07-09 at 13 41 23

The rules got expanded. I added 2999 that was the pre-existing rule. And it's still there.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

I also tried with 0.52.0 and my SG retained all inbound and outbound rules and they weren't replaced. So there must be something else that's happening here. How are you contacting the cluster if VPC is set to endpoint private? Are you running from a specific machine with access I assume?

@kevin-lindsay-1
Copy link
Author

kevin-lindsay-1 commented Jul 9, 2021

I use an AWS VPN Client connection which is attached to the VPC. No jumpboxes, easy 1-click connection.

Where are the rules for your Windows Node VPC Controllers?

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Where are the rules for your Windows Node VPC Controllers?

What rules? I don't have any. I'm just using a sample of the cluster create config that you provided. :)

@kevin-lindsay-1
Copy link
Author

kevin-lindsay-1 commented Jul 9, 2021

I think I followed the instructions and did --install-vpc-controllers in a follow-up command, which may have been where this happened.

Pretty hard to say exactly where it happened, because eksctl is currently imperative, not declarative yet (no apply command), so replication can be harder because of stateful things such as order of operations.

It's certainly possible that I just did something in/out of order, and you might be using a slightly different order of operations.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Oh, sorry, I thought you meant that you had something custom. Yes, I did use --install-vpc-controllers with the call. Let me try and find the controller settings.

@kevin-lindsay-1
Copy link
Author

It's possible that this bug would go away completely or rear its head more clearly if we "simply" had an apply command so that replication is basically 1 command.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

You mean like this? :)
#2774

@kevin-lindsay-1
Copy link
Author

Yep, pretty sure it's explicitly on the roadmap, too.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Indeed. We'll get there eventually. :)

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Ok, so I've set up a VPN connection, have a vpc existing subnets and a sec-group. Launched a new cluster create with the attached config and now I'm waiting for things to happen.

@kevin-lindsay-1
Copy link
Author

kevin-lindsay-1 commented Jul 9, 2021

Oh, sorry, I thought you meant that you had something custom. Yes, I did use --install-vpc-controllers with the call. Let me try and find the controller settings.

Yep, just to be clear:

  • I called commands with --profile, if that would somehow matter
  • I used a VPC with a VPN connection connected so that the private endpoint call doesn't fail
  • I used a main sg with pre-existing rules
  • I used subnets with pre-existing route tables
  • I used windows hosts
  • I'm fairly certain that I followed the order of operations on https://eksctl.io/usage/windows-worker-nodes. I managed to internally replicate this issue like 4 or 5 times, so I'd like to think it's the exact OoO called out in the docs.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Cool, thanks. I'm doing the same thing. All tries using 0.52? Can you possibly try 0.57?

@kevin-lindsay-1
Copy link
Author

When I checked in CloudFormation, I saw a resource for the main SG, and it was in the CREATE_COMPLETE status, which I imagine wouldn't happen if it was referencing the SG, because I would expect the resource type to be a rule, not a security group.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

And that was for the existing SG and not the Cluster SG that is created in addition? :O Huh, that is super weird.

I see the ClusterSharedNodeSecurityGroup as CREATE_COMPLETE.

This is the full cluster create event list:

Screenshot 2021-07-09 at 16 33 57

And the node groups created their own SGs which is linked in the SharedNodeSecurityGroup.

@kevin-lindsay-1
Copy link
Author

I'm spinning up a new one right now for test purposes; I created some test rules inbound and outbound to see if they go away.

@kevin-lindsay-1
Copy link
Author

I'm running on 0.56.0 right now; if 0.57.0 is out homebrew must not have it yet.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Uh, I messed something up and my windows instances would not join the cluster. :)

Do you think that matters? The SG is already edited and looks fine.

@kevin-lindsay-1
Copy link
Author

Mine's rolling out, we should be able to see if that's relevant.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

🤞

@kevin-lindsay-1
Copy link
Author

I have tested this setup using the commands and same exact config as the one used in the original post, and I no longer see the securityGroup being created in CloudFormation on 0.56.0. I think the SharedNodeSecurityGroup was previously being created with the wrong SG; using the main additional SG rather than the EKS cluster SG.

Previously the eksctl --profile=<profile-name> create cluster -f <cluster-name>.yaml --install-vpc-controllers command would fail due to the SG error, but now the cluster creation and Windows NodeGroups rolled out just fine.

This sounds like a good time to close this issue; because of how persistent this issue was in 0.52.0 for me with this exact same setup, I think that a reference to additional vs cluster SG was correctly flipped (or maybe set at runtime) in a commit over the last few weeks.

As far as I'm concerned, I'm going to /shrug and call this "can no longer replicate", which is the same thing as being fixed, right? ;)

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Aye! :D Also, I finally succeeded in creating my cluster as well, and it seemed to work fine. I'm happy that this could be resolved. It was a fantastic journey for sure! Thank you for that! :)

@kevin-lindsay-1
Copy link
Author

Yes, just meeting my monthly quota of "give someone a hyper-specific problem with software that is in heavy flux and possibly already fixed".

@Skarlso
Copy link
Contributor

Skarlso commented Jul 9, 2021

Those are the best. 😂

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants