-
Notifications
You must be signed in to change notification settings - Fork 447
test: E2E configuration changes to address flakes #2451
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
test: E2E configuration changes to address flakes #2451
Conversation
The increase to the AKS cluster delete timeout is in response to the fact that 9 of the last 28 periodic test runs have included this error. https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-main One might rightly question why the main job fails at such a high rate compared to the v1beta1 ( I verified that the templates are identical, so if there's indeed something in capz that is the cause of slower deletion times it might be in changes to our controller implementation. I also don't see any of these cluster delete failures in our PR test run history: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-e2e-exp-main |
/test pull-cluster-api-provider-azure-e2e-optional |
Test failure:
|
/retest |
Indeed... 60 minutes to delete a single node AKS cluster seems extremely long. Can we look at the data for AKS cluster deletions that timed out recently in the test sub to try and understand if it actually took that long? I think the other two changes are reasonable but I would like to dig in a bit more into the AKS deletions before we silence the signal by upping the timeout to 60 minutes. It could be telling us something is wrong in our code and we need to pay attention as we're about to release this new version of the code, especially if we're not observing the same failures in release-1.3. Are you able to split the AKS timeout change into its own PR so we can investigate/merge it/revert it on its own? |
Very possible that the issue was introduced by #2168 since that's the most recent change to AKS deletion and I believe it isn't part of release 1.3... I'll try to dig in a bit more tomorrow |
4fe11d8
to
4d16805
Compare
@CecileRobertMichon there are no equivalent recorded AKS delete flakes from presubmit jobs (since 21 June): https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-e2e-exp-main I went ahead and reverted the 60m timeout for AKS (though I kept the discrete config in place to make this easier to iterate over in the future if test flakes continue to be observed). |
/test pull-cluster-api-provider-azure-e2e-optional |
Flake above is GPU flake, but it is not a slow-to-become-ready node. Rather: E0706 09:32:15.862842 1 azuremachine_controller.go:278] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Failed to initialize machine cache" "error"="failed to get VM SKU Standard_NV6 in compute api: reconcile error that cannot be recovered occurred: resource sku with name 'Standard_NV6' and category 'virtualMachines' not found in location 'canadacentral'. Object will not be requeued" "name"="capz-e2e-0dd399-gpu-md-0-6hntj" "namespace"="capz-e2e-0dd399" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="0729fd4a-7161-4202-9c08-23312e82d42b" I was able to easily repro lack of SKU capacity in canadacentral. One thing that could explain different test results between main and release-1.3 is the region list: main: I'll enumerate through the regions in main and look for any other that don't have GPU SKU. And then I think we should always standardize test regions for all branches that get periodic coverage. |
4d16805
to
222cd89
Compare
My tests suggest that "centralus" and "canadacentral" are not able to allocate Standard_NV6, so I've introduced a new |
I repro'd the GPU node readiness timeout failure locally, and saw this in the capz-controller-manager logs: I0706 16:50:14.164607 1 azuremachine_controller.go:246] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Error state detected, skipping reconciliation" "name"="capz-e2e-5fjgdl-gpu-md-0-cv72d" "namespace"="capz-e2e-5fjgdl" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="7ec00c4c-8c9e-49ef-a68b-7b8d6455ff10" What does that mean? |
The CAPZ controller doesn't reconciled failed AzureMachines because they are immutable. Once a machine is in "failed" state, the only way to recover it is to delete it and create a new one (or let the health check do it for you if you have one configured). The best way to know more about what happened would be to look at the |
222cd89
to
3637cb7
Compare
/test pull-cluster-api-provider-azure-e2e-optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind flake
What this PR does / why we need it:
In response to observed flakes, this PR changes the E2E configuration in the following way:
AZURE_LOCATION_GPU
)Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2448
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: