Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Karpenter is taking longer time to fallback to lower weighted nodepool when ICE errors are hit #1899

Open
bparamjeet opened this issue Jan 3, 2025 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@bparamjeet
Copy link

Description

Observed Behavior:
Karpenter is taking too long to fallback to a lower-weighted node pool when ICE errors occur. Sudden increase in pod replica count leaves all pods in a pending state for over a period of time.

Expected Behavior:
Karpenter should fallback to lower weighted nodepool immediately when ICE errors occur.

Reproduction Steps (Please include YAML):

  • Create multiple nodepools with multiple weights.
  • Increase the replica count of a deployment to a larger number.
  • Karpenter won't be able to create new nodes due to ICE errors and pods will be accumulated in pending state.
  • Karpenter will show ICE errors over a period of time till it fallback

Versions:

  • Karpenter Version: 1.0.5
  • Kubernetes Version (kubectl version): v1.31
Screenshot 2025-01-03 at 2 42 55 PM Screenshot 2025-01-03 at 2 43 43 PM Screenshot 2025-01-03 at 2 55 07 PM
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@bparamjeet bparamjeet added the kind/bug Categorizes issue or PR as related to a bug. label Jan 3, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 3, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Vacant2333
Copy link

Vacant2333 commented Jan 7, 2025

Is a necessary condition for this problem to occur? karpenter will cache InstanceTypes that failed to create and will not try again for a few minutes @bparamjeet
In your tests, how long would it take for karpenter to fall back into the lower-weighted node pool?

@bparamjeet
Copy link
Author

In your tests, how long would it take for karpenter to fall back into the lower-weighted node pool ?

  • Karpenter did not fallback to the standby nodepools which is of lower weights.
  • We intervened in between to mark the lower weights for c5.9x which then helped to create nodes.
  • We have multiple nodepools with same weight c7i.8x, c7i.12x, c6i.8x, c6i.12x, c5.9x, c5.12x , After ICE'd error for c7i.8x,c6i.8x karpenter is not fallbacking to c5.9x, c5.12x . why it preferring to choose c7i, c6i ? why karpeter not provisioning instance in c5 when getting ice with c6i and c7i ?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants