Avoid subnets that don't have available IP Addresses #5234

ellistarn · 2023-12-05T17:41:33Z

Description

What problem are you trying to solve?

When a subnet is almost out of IPs, Karpenter will continue to launch nodes in it, leading to the VPC CNI failing to become ready, and the node becoming unready as well. In many cases, there's nothing we can do, but if another subnet has IP addresses, and the workload does not have scheduling constraints that prevent it from running in those zones, we should launch in the subnets with available IPs.

How important is this feature to you?

Managing the ipv4 space is hard, and anything we can do to alleviate these pains would help customers on their path to ipv6.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

martinsmatthews · 2023-12-05T18:01:20Z

@ellistarn if we have multiple available subnets available for a zone, will karpenter choose the least full subnet like it says in here https://karpenter.sh/v0.32/concepts/nodeclasses/#specsubnetselectorterms ?

ellistarn · 2023-12-05T21:59:47Z

Correct -- this is a good point. We don't do this for subnets in different zones.

martinsmatthews · 2023-12-06T12:20:47Z

Would be really good if we could do this, as this would/could also naturally balance instances across zones which would be a nice HA feature. We have seen that unless there is a topology spread in our deployments, multiple instances get spun up in the same zone.

sthapa-ping · 2023-12-08T23:02:05Z

This is one of the burning issue that we are currently facing with Karpenter, we discussed about this in reinvent. Checking available IPs across the subnets using AWS endpoint and scheduling next instances to the subnet with most IPs seems straight forward. Any approximate release timeline for when we can expect this feature to be added?

ellistarn · 2023-12-10T02:59:57Z

Are you finding that there are 0 remaining IPs and the launch fails, or just a few IPs? Can you provide logs that occur when this happens? Is the failure at the node level? Pod level?

martinsmatthews · 2024-02-12T10:59:03Z

Are you finding that there are 0 remaining IPs and the launch fails, or just a few IPs?

No we were seeing that without pod anti affiinity to force them to be spread across zones we'd often end up with multiple nodes in the 1 zone and none in the other 2 zones, then all the pods trying to spin up and exhaust the subnet and we see pods get stuck in pending as the CNI can't assign them an IP. Note that this is using lots of small (cpu/mem) pods, relatively small subnets (/26) and pod security groups. We weren't seeing node launches fail.

We're not seeing this issue as we moved to larger subnets and added the anti affinity which means the nodes are spread across the zones/subnets more evenly.

Am happy to recreate this and send some logs if that would be helpful @ellistarn ?

ellistarn · 2024-02-13T03:00:17Z

@martinsmatthews , have you completely exhausted the ipv4 space? Is it possible to add another subnet with more IPs? Why are you so constrained?

I'm working on an idea for EKS networking and I'd love to chat more over slack.

martinsmatthews · 2024-02-13T15:50:56Z

Hi @ellistarn sorry this was just an example which would highlight the issue we're discussing, we don't have this problem any more as we just gave this nodepool 3 /25s and all is fine now.

Fwiw, we run a number of different nodepools per cluster, some with quite small numbers of pods. For example the one we were seeing this issue in runs no more than 40-60 pods at any one time so 3x /26 was enough IPs to allow this. We are not exactly resource constrained but at the same time we need to think about how much internal IP space we have an allocate it sensibly as it is a finite resource.

Definitely happy to chat on Slack - will ping you.

martinsmatthews · 2024-03-20T16:11:38Z

Coming back to the use of topologySpreadConstraints to solve this issue and stripe pods over AZs (and thus over subnets) this doesn't work perfectly. I spun up 10 deployments with 3 replicas and with a topologySpreadConstraints like (plus a large enough resource ask to end up with only one pod per node) with a nodepool that had 3 subnets defined, 1 in each of 3 AZs:

      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: topology-test-{{ count }}

and this gave me 30 nodes, 10 in each AZ - bingo. But... then I dropped the replica count to 2 and spun up 15 deployments and this gave me a very uneven spread of nodes:

us-west-2a: 19
us-west-2b: 7
us-west-2c: 14

Obviously this is artificial, but this does again highlight a need for an option to balance nodes across AZs even if this is not the default. And not just for subnet usage reasons, there is an HA risk here for when we lose a zone - it doesn't happen often, but when it does and we were skewed like this in a production cluster, it wouldn't be pretty.

Shadowssong · 2024-08-13T00:30:00Z

Has anyone come up with a solution to work around this? We have twice run into a situation where Karpenter overloads a single AZ (with two /20 subnets) and both subnets run out of IP's. The skew was quite extreme (200 nodes in 1 AZ, 50 in the other 3) and it seems like a burst of spot requests may have put them all on the same AZ, but it still seems odd that Karpenter doesn't attempt to do any kind of load balancing across the AZ's. Our current work around is to define a nodepool per AZ and use the cpu/memory limits to faux-limit the number of nodes to prevent IP exhaust, but this results in a lot of nodepools and ec2nodeclass's. This only works because we currently use a fairly strict set of instance types so we know what their max pod count would be. If anyone has come up with a better solution please share!

snieg · 2024-09-25T09:49:49Z

We have twice run into a situation where Karpenter overloads a single AZ (with two /20 subnets) and both subnets run out of IP's.
Same issue here.

After migrating to karpenter, it started favoring one zone (eu-west-1c), despite having over 1000 addresses available in other subnets, which ended up with us running out of addresses in eu-west-1c.

╰─❯ kubectl get nodes -L  worker-type,group,topology.kubernetes.io/zone --sort-by=.metadata.creationTimestamp --no-headers   -l group=default | awk '{print $8}'  |sort | uniq -c
   7 eu-west-1a
   7 eu-west-1b
  25 eu-west-1c

different cluster but still running on cluster-autoscaler:

18 eu-west-1a
17 eu-west-1b
18 eu-west-1c

maxforasteiro · 2025-01-02T12:44:24Z

hey, can we have some 👀 in the opened PRs? @ellistarn

AayushBangroo · 2025-01-06T03:34:11Z

Just had pods in our production clusters stuck because karpenter keeps launching nodes in a single subnet even when multiple subnets specified. Any idea when the fix will be merged and released?

Vacant2333 · 2025-01-06T06:50:52Z

@AayushBangroo Based on the current logic, karpenter will choose the subnet with the most predicted margin to create a Node. You can temporarily create a subnet with more margin in a zone with insufficient resources to solve this problem.

Vacant2333 · 2025-01-06T06:57:20Z

Of course, I also hope that my PR can be reviewed & merged as soon as possible. We May need more approvers/maintainers, because they are really too busy with maintainers at the moment.
cc @njtran @jonathan-innis

AayushBangroo · 2025-01-06T07:23:07Z

@AayushBangroo Based on the current logic, karpenter will choose the subnet with the most predicted margin to create a Node. You can temporarily create a subnet with more margin in a zone with insufficient resources to solve this problem.

Thanks for the reply. What do you mean by most predicted margin?

Vacant2333 · 2025-01-06T07:39:54Z

@AayushBangroo Simply put, for an AZ, karpenter will select one of the Subnets for that AZ, and try to select the Subnet with the most predicted remaining IPs. So you only need to create a bigger subnet for the AZ which may pending pods.

dfsdevops · 2025-01-06T19:53:16Z

It seems to me like theres two separate issues:

Karpenter doesn't seem to round robin it's AZ selection unless working around scheduling constraints on workloads.
Has no way to detect if a subnet is out of IPs.

Seems like folks are talking about fixing item no. 2, but I'm still not sure what explains the behavior of no. 1. Unless I missed something in this thread. It seems to me that both issues should be addressed, one just makes the other a little more apparent but I see them as separate issues.

EDIT: this seems to explain it for some cases. #2144 - but I'm having the same issue for on-demand instances. Probably some other weight thats being considered. I'll see if I can find a workaround that works for me.

Vacant2333 · 2025-01-07T07:10:33Z

I don't think the first problem is Karpenter's problem. One of Karpenter's main purposes is to help you reduce costs to the greatest extent. The price of each instance in different AZ is different. The price of the same instance in different AZ in the same Region may differ by 80%. To solve this problem, users need to configure their own loads and provide enough Subnets to avoid this problem. For the second problem, it exists, and Karpenter can only try to estimate the available IP for each subnet based on the information it already knows. But at present, even if Karpenter estimates that a Subnet does not have enough IP, he will still try to create a Node in this AZ. This is an aspect I want to address. That is, when AZ's Subnet is insufficient, avoid choosing this AZ. rather than Pending it after it is created.
@dfsdevops If you have any other idea, just share it at here. Thanks

ellistarn added feature New feature or request needs-triage Issues that need to be triaged and removed needs-triage Issues that need to be triaged labels Dec 5, 2023

njtran mentioned this issue Dec 13, 2023

Create pod on a node with no more IP #5317

Closed

Vacant2333 linked a pull request Nov 1, 2024 that will close this issue

fix: avoid selecting subnets with insufficient available IP address #7310

Open

3 tasks

Summonair mentioned this issue Dec 21, 2024

test: added a test where there is no ips #7549

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid subnets that don't have available IP Addresses #5234

Avoid subnets that don't have available IP Addresses #5234

ellistarn commented Dec 5, 2023 •

edited

Loading

martinsmatthews commented Dec 5, 2023

ellistarn commented Dec 5, 2023 •

edited

Loading

martinsmatthews commented Dec 6, 2023

sthapa-ping commented Dec 8, 2023

ellistarn commented Dec 10, 2023

martinsmatthews commented Feb 12, 2024 •

edited

Loading

ellistarn commented Feb 13, 2024

martinsmatthews commented Feb 13, 2024

martinsmatthews commented Mar 20, 2024

Shadowssong commented Aug 13, 2024

snieg commented Sep 25, 2024 •

edited

Loading

maxforasteiro commented Jan 2, 2025

AayushBangroo commented Jan 6, 2025

Vacant2333 commented Jan 6, 2025

Vacant2333 commented Jan 6, 2025

AayushBangroo commented Jan 6, 2025

Vacant2333 commented Jan 6, 2025

dfsdevops commented Jan 6, 2025 •

edited

Loading

Vacant2333 commented Jan 7, 2025

Avoid subnets that don't have available IP Addresses #5234

Avoid subnets that don't have available IP Addresses #5234

Comments

ellistarn commented Dec 5, 2023 • edited Loading

Description

martinsmatthews commented Dec 5, 2023

ellistarn commented Dec 5, 2023 • edited Loading

martinsmatthews commented Dec 6, 2023

sthapa-ping commented Dec 8, 2023

ellistarn commented Dec 10, 2023

martinsmatthews commented Feb 12, 2024 • edited Loading

ellistarn commented Feb 13, 2024

martinsmatthews commented Feb 13, 2024

martinsmatthews commented Mar 20, 2024

Shadowssong commented Aug 13, 2024

snieg commented Sep 25, 2024 • edited Loading

maxforasteiro commented Jan 2, 2025

AayushBangroo commented Jan 6, 2025

Vacant2333 commented Jan 6, 2025

Vacant2333 commented Jan 6, 2025

AayushBangroo commented Jan 6, 2025

Vacant2333 commented Jan 6, 2025

dfsdevops commented Jan 6, 2025 • edited Loading

Vacant2333 commented Jan 7, 2025

ellistarn commented Dec 5, 2023 •

edited

Loading

ellistarn commented Dec 5, 2023 •

edited

Loading

martinsmatthews commented Feb 12, 2024 •

edited

Loading

snieg commented Sep 25, 2024 •

edited

Loading

dfsdevops commented Jan 6, 2025 •

edited

Loading