test: E2E configuration changes to address flakes #2451

jackfrancis · 2022-07-05T15:01:27Z

What type of PR is this?

/kind flake

What this PR does / why we need it:

In response to observed flakes, this PR changes the E2E configuration in the following way:

Adds a discrete configuration for enforcing GPU MachineDeployment provisioning.

set to 30 mins to account for observed operational slowness of Azure GPU VM bootstrapping

Adds a discrete location configuration for GPU clusters (AZURE_LOCATION_GPU)

removed "centralus" and "canadacentral" from this new regions array after observing NV_6 allocation failures there

Adds a discrete configuration for enforcing HA (3 control plane nodes) control plane readiness.

set to 30 mins (10 mins higher than current control plane timeout enforcement) to account for the add'l 2 nodes that are serially provisioned in multiple control plane node scenarios

Adds a discrete configuration for enforcing AKS cluster delete timeout to make this easier to tune in the future.

set to 30 mins initially to match existing timeout enforcement

Fixes ClusterClass worker node readiness timeout enforcement to use the common 25m timeout configuration for MachineDeployments.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2448

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

NONE

jackfrancis · 2022-07-05T15:12:18Z

The increase to the AKS cluster delete timeout is in response to the fact that 9 of the last 28 periodic test runs have included this error.

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-main

One might rightly question why the main job fails at such a high rate compared to the v1beta1 (release-1.3 branch as of this writing) periodic test:

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-v1beta1

I verified that the templates are identical, so if there's indeed something in capz that is the cause of slower deletion times it might be in changes to our controller implementation.

I also don't see any of these cluster delete failures in our PR test run history:

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-e2e-exp-main

jackfrancis · 2022-07-05T15:12:41Z

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

jackfrancis · 2022-07-05T16:48:29Z

Test failure:

private cluster scenario, timed out (25m) waiting for one worker node (no logs)

jackfrancis · 2022-07-05T16:50:33Z

/retest

CecileRobertMichon · 2022-07-05T21:59:31Z

One might rightly question why the main job fails at such a high rate compared to the v1beta1 (release-1.3 branch as of this writing) periodic test

Indeed... 60 minutes to delete a single node AKS cluster seems extremely long. Can we look at the data for AKS cluster deletions that timed out recently in the test sub to try and understand if it actually took that long?

I think the other two changes are reasonable but I would like to dig in a bit more into the AKS deletions before we silence the signal by upping the timeout to 60 minutes. It could be telling us something is wrong in our code and we need to pay attention as we're about to release this new version of the code, especially if we're not observing the same failures in release-1.3.

Are you able to split the AKS timeout change into its own PR so we can investigate/merge it/revert it on its own?

CecileRobertMichon · 2022-07-06T00:28:02Z

Very possible that the issue was introduced by #2168 since that's the most recent change to AKS deletion and I believe it isn't part of release 1.3... I'll try to dig in a bit more tomorrow

jackfrancis · 2022-07-06T07:58:17Z

@CecileRobertMichon there are no equivalent recorded AKS delete flakes from presubmit jobs (since 21 June):

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-e2e-exp-main

I went ahead and reverted the 60m timeout for AKS (though I kept the discrete config in place to make this easier to iterate over in the future if test flakes continue to be observed).

jackfrancis · 2022-07-06T09:19:22Z

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

jackfrancis · 2022-07-06T11:45:05Z

Flake above is GPU flake, but it is not a slow-to-become-ready node. Rather:

E0706 09:32:15.862842 1 azuremachine_controller.go:278] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Failed to initialize machine cache" "error"="failed to get VM SKU Standard_NV6 in compute api: reconcile error that cannot be recovered occurred: resource sku with name 'Standard_NV6' and category 'virtualMachines' not found in location 'canadacentral'. Object will not be requeued" "name"="capz-e2e-0dd399-gpu-md-0-6hntj" "namespace"="capz-e2e-0dd399" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="0729fd4a-7161-4202-9c08-23312e82d42b"

I was able to easily repro lack of SKU capacity in canadacentral. One thing that could explain different test results between main and release-1.3 is the region list:

main: local REGIONS=("canadacentral" "centralus" "eastus" "eastus2" "northeurope" "uksouth" "westeurope" "westus2" "westus3")
release-1.3: local REGIONS=("eastus" "eastus2" "northeurope" "uksouth" "westeurope" "westus2")

I'll enumerate through the regions in main and look for any other that don't have GPU SKU. And then I think we should always standardize test regions for all branches that get periodic coverage.

jackfrancis · 2022-07-06T14:09:55Z

My tests suggest that "centralus" and "canadacentral" are not able to allocate Standard_NV6, so I've introduced a new AZURE_LOCATION_GPU configuration vector in order to maintain a GPU-specific regions array for tests.

scripts/ci-e2e.sh

test/e2e/common.go

jackfrancis · 2022-07-06T17:56:32Z

I repro'd the GPU node readiness timeout failure locally, and saw this in the capz-controller-manager logs:

I0706 16:50:14.164607 1 azuremachine_controller.go:246] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Error state detected, skipping reconciliation" "name"="capz-e2e-5fjgdl-gpu-md-0-cv72d" "namespace"="capz-e2e-5fjgdl" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="7ec00c4c-8c9e-49ef-a68b-7b8d6455ff10"

What does that mean?

test/e2e/azure_test.go

CecileRobertMichon · 2022-07-06T18:03:50Z

I0706 16:50:14.164607 1 azuremachine_controller.go:246] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Error state detected, skipping reconciliation" "name"="capz-e2e-5fjgdl-gpu-md-0-cv72d" "namespace"="capz-e2e-5fjgdl" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="7ec00c4c-8c9e-49ef-a68b-7b8d6455ff10"

What does that mean?

The CAPZ controller doesn't reconciled failed AzureMachines because they are immutable. Once a machine is in "failed" state, the only way to recover it is to delete it and create a new one (or let the health check do it for you if you have one configured). The best way to know more about what happened would be to look at the FailureReason in the failed AzureMachine status.

jackfrancis · 2022-07-06T18:13:32Z

@CecileRobertMichon O.K. great, this gives us definitive info about regional availability. E.g.:

I was able to sometimes get a passing GPU test locally in westus3, but not 100% of the time. The westus3 failure in the link above (from a testgrid job) reinforces we shouldn't be testing GPU scenarios there.

And, to recapitulate: westus3, canadacentral, and centralus (the regions in the above-linked failures) are not in the release-1.3 branch, so this would entirely explain why the v1beta1 tests are not flaking for GPU scenarios compared to main.

jackfrancis · 2022-07-06T18:31:46Z

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

CecileRobertMichon

/lgtm
/approve

k8s-ci-robot · 2022-07-06T19:48:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 5, 2022

k8s-ci-robot requested review from devigned and mboersma July 5, 2022 15:01

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 5, 2022

jackfrancis mentioned this pull request Jul 5, 2022

capz: debug failing periodic main job kubernetes/test-infra#26746

Closed

kubernetes-sigs deleted a comment from k8s-ci-robot Jul 5, 2022

jackfrancis mentioned this pull request Jul 6, 2022

Flaky periodic E2E test results #2448

Closed

jackfrancis force-pushed the capz-periodic-flakes branch 2 times, most recently from 4fe11d8 to 4d16805 Compare July 6, 2022 07:57

jackfrancis force-pushed the capz-periodic-flakes branch from 4d16805 to 222cd89 Compare July 6, 2022 14:06

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 6, 2022

CecileRobertMichon reviewed Jul 6, 2022

View reviewed changes

scripts/ci-e2e.sh Show resolved Hide resolved

CecileRobertMichon reviewed Jul 6, 2022

View reviewed changes

test/e2e/common.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Jul 6, 2022

View reviewed changes

test/e2e/azure_test.go Outdated Show resolved Hide resolved

test: E2E configuration changes to address flakes

3637cb7

jackfrancis force-pushed the capz-periodic-flakes branch from 222cd89 to 3637cb7 Compare July 6, 2022 18:14

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 6, 2022

CecileRobertMichon reviewed Jul 6, 2022

View reviewed changes

k8s-ci-robot assigned CecileRobertMichon Jul 6, 2022

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 6, 2022

k8s-ci-robot merged commit 503acc7 into kubernetes-sigs:main Jul 6, 2022

k8s-ci-robot added this to the v1.4 milestone Jul 6, 2022

jackfrancis deleted the capz-periodic-flakes branch December 9, 2022 22:39

test: E2E configuration changes to address flakes #2451

test: E2E configuration changes to address flakes #2451

Uh oh!

Conversation

jackfrancis commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackfrancis commented Jul 5, 2022

Uh oh!

jackfrancis commented Jul 5, 2022

Uh oh!

jackfrancis commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackfrancis commented Jul 5, 2022

Uh oh!

CecileRobertMichon commented Jul 5, 2022

Uh oh!

CecileRobertMichon commented Jul 6, 2022

Uh oh!

jackfrancis commented Jul 6, 2022

Uh oh!

jackfrancis commented Jul 6, 2022

Uh oh!

jackfrancis commented Jul 6, 2022

Uh oh!

jackfrancis commented Jul 6, 2022

Uh oh!

Uh oh!

Uh oh!

jackfrancis commented Jul 6, 2022

Uh oh!

Uh oh!

CecileRobertMichon commented Jul 6, 2022

Uh oh!

jackfrancis commented Jul 6, 2022

Uh oh!

jackfrancis commented Jul 6, 2022

Uh oh!

CecileRobertMichon left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jul 6, 2022

Uh oh!

Uh oh!

jackfrancis commented Jul 5, 2022 •

edited

Loading

jackfrancis commented Jul 5, 2022 •

edited

Loading