Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

EBS CSI Driver issue causing kubetest2 failures - IMDS metadata and Kubernetes metadata are both unavailable #1061

Open
mmerkes opened this issue Nov 25, 2024 · 6 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mmerkes
Copy link
Contributor

mmerkes commented Nov 25, 2024

Which jobs are failing:

pull-cloud-provider-aws-e2e-kubetest2-quick
pull-cloud-provider-aws-e2e-kubetest2

Which test(s) are failing:
BeforeSuite is failing because CPI nodes aren't stabilizing.

Since when has it been failing:
This one passed on 10/31.

This one failed on 11/6. So sometime between these two.

Testgrid link:

  1. First seen failure
  2. Failed 11/25

Reason for failure:

EBS CSI pod is not stabilizing:

2024-11-25T18:30:42.52251214Z stderr F I1125 18:30:42.522404       1 main.go:157] "Initializing metadata"
2024-11-25T18:30:47.523520821Z stderr F E1125 18:30:47.523424       1 metadata.go:51] "Retrieving IMDS metadata failed, falling back to Kubernetes metadata" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded"
2024-11-25T18:30:47.530862069Z stderr F E1125 18:30:47.530760       1 metadata.go:58] "Retrieving Kubernetes metadata failed" err="could not retrieve instance type from topology label"
2024-11-25T18:30:47.530928736Z stderr F E1125 18:30:47.530882       1 main.go:162] "Failed to initialize metadata when it is required" err="IMDS metadata and Kubernetes metadata are both unavailable"

Anything else we need to know:

/kind failing-test

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 25, 2024
@mmerkes
Copy link
Contributor Author

mmerkes commented Nov 25, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 25, 2024
@dims
Copy link
Member

dims commented Nov 25, 2024

cc @ConnorJC3 @torredil

@mmerkes
Copy link
Contributor Author

mmerkes commented Nov 25, 2024

Not sure if they're related to each other, but also see this error in kubelet:

Nov 25 18:34:03 ip-172-31-24-156 kubelet[6298]: E1125 18:34:03.425509 6298 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"aws-cloud-controller-manager\" with ImagePullBackOff: \"Back-off pulling image \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": failed to resolve reference \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": 209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea: not found\"" pod="kube-system/aws-cloud-controller-manager-cq6m2" podUID="b6d43d27-1967-414e-86f8-72b3e9375664"

@ConnorJC3
Copy link

Not sure if they're related to each other, but also see this error in kubelet:

Very likely related - as I believe it is the AWS CCM that adds the labels we rely on for metadata to the node.

@mmerkes
Copy link
Contributor Author

mmerkes commented Nov 25, 2024

Very likely related - as I believe it is the AWS CCM that adds the labels we rely on for metadata to the node.

Sounds right. Looks like that's a red herring.

@lavalex
Copy link

lavalex commented Dec 18, 2024

I'm getting this error on Openshift .... Any ideas how to solve it? Thanks.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants