Skip to content

test: E2E configuration changes to address flakes #2451

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged

Conversation

jackfrancis
Copy link
Contributor

@jackfrancis jackfrancis commented Jul 5, 2022

What type of PR is this?

/kind flake

What this PR does / why we need it:

In response to observed flakes, this PR changes the E2E configuration in the following way:

  1. Adds a discrete configuration for enforcing GPU MachineDeployment provisioning.
  • set to 30 mins to account for observed operational slowness of Azure GPU VM bootstrapping
  1. Adds a discrete location configuration for GPU clusters (AZURE_LOCATION_GPU)
  • removed "centralus" and "canadacentral" from this new regions array after observing NV_6 allocation failures there
  1. Adds a discrete configuration for enforcing HA (3 control plane nodes) control plane readiness.
  • set to 30 mins (10 mins higher than current control plane timeout enforcement) to account for the add'l 2 nodes that are serially provisioned in multiple control plane node scenarios
  1. Adds a discrete configuration for enforcing AKS cluster delete timeout to make this easier to tune in the future.
  • set to 30 mins initially to match existing timeout enforcement
  1. Fixes ClusterClass worker node readiness timeout enforcement to use the common 25m timeout configuration for MachineDeployments.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2448

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 5, 2022
@k8s-ci-robot k8s-ci-robot requested review from devigned and mboersma July 5, 2022 15:01
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 5, 2022
@jackfrancis
Copy link
Contributor Author

The increase to the AKS cluster delete timeout is in response to the fact that 9 of the last 28 periodic test runs have included this error.

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-main

One might rightly question why the main job fails at such a high rate compared to the v1beta1 (release-1.3 branch as of this writing) periodic test:

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-v1beta1

I verified that the templates are identical, so if there's indeed something in capz that is the cause of slower deletion times it might be in changes to our controller implementation.

I also don't see any of these cluster delete failures in our PR test run history:

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-e2e-exp-main

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

@jackfrancis
Copy link
Contributor Author

jackfrancis commented Jul 5, 2022

Test failure:

  • private cluster scenario, timed out (25m) waiting for one worker node (no logs)

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jul 5, 2022
@jackfrancis
Copy link
Contributor Author

/retest

@CecileRobertMichon
Copy link
Contributor

One might rightly question why the main job fails at such a high rate compared to the v1beta1 (release-1.3 branch as of this writing) periodic test

Indeed... 60 minutes to delete a single node AKS cluster seems extremely long. Can we look at the data for AKS cluster deletions that timed out recently in the test sub to try and understand if it actually took that long?

I think the other two changes are reasonable but I would like to dig in a bit more into the AKS deletions before we silence the signal by upping the timeout to 60 minutes. It could be telling us something is wrong in our code and we need to pay attention as we're about to release this new version of the code, especially if we're not observing the same failures in release-1.3.

Are you able to split the AKS timeout change into its own PR so we can investigate/merge it/revert it on its own?

@CecileRobertMichon
Copy link
Contributor

Very possible that the issue was introduced by #2168 since that's the most recent change to AKS deletion and I believe it isn't part of release 1.3... I'll try to dig in a bit more tomorrow

@jackfrancis jackfrancis force-pushed the capz-periodic-flakes branch 2 times, most recently from 4fe11d8 to 4d16805 Compare July 6, 2022 07:57
@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon there are no equivalent recorded AKS delete flakes from presubmit jobs (since 21 June):

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-e2e-exp-main

I went ahead and reverted the 60m timeout for AKS (though I kept the discrete config in place to make this easier to iterate over in the future if test flakes continue to be observed).

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

@jackfrancis
Copy link
Contributor Author

Flake above is GPU flake, but it is not a slow-to-become-ready node. Rather:

E0706 09:32:15.862842 1 azuremachine_controller.go:278] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Failed to initialize machine cache" "error"="failed to get VM SKU Standard_NV6 in compute api: reconcile error that cannot be recovered occurred: resource sku with name 'Standard_NV6' and category 'virtualMachines' not found in location 'canadacentral'. Object will not be requeued" "name"="capz-e2e-0dd399-gpu-md-0-6hntj" "namespace"="capz-e2e-0dd399" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="0729fd4a-7161-4202-9c08-23312e82d42b"

I was able to easily repro lack of SKU capacity in canadacentral. One thing that could explain different test results between main and release-1.3 is the region list:

main: local REGIONS=("canadacentral" "centralus" "eastus" "eastus2" "northeurope" "uksouth" "westeurope" "westus2" "westus3")
release-1.3: local REGIONS=("eastus" "eastus2" "northeurope" "uksouth" "westeurope" "westus2")

I'll enumerate through the regions in main and look for any other that don't have GPU SKU. And then I think we should always standardize test regions for all branches that get periodic coverage.

@jackfrancis jackfrancis force-pushed the capz-periodic-flakes branch from 4d16805 to 222cd89 Compare July 6, 2022 14:06
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 6, 2022
@jackfrancis
Copy link
Contributor Author

My tests suggest that "centralus" and "canadacentral" are not able to allocate Standard_NV6, so I've introduced a new AZURE_LOCATION_GPU configuration vector in order to maintain a GPU-specific regions array for tests.

@jackfrancis
Copy link
Contributor Author

I repro'd the GPU node readiness timeout failure locally, and saw this in the capz-controller-manager logs:

I0706 16:50:14.164607       1 azuremachine_controller.go:246] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Error state detected, skipping reconciliation" "name"="capz-e2e-5fjgdl-gpu-md-0-cv72d" "namespace"="capz-e2e-5fjgdl" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="7ec00c4c-8c9e-49ef-a68b-7b8d6455ff10"

What does that mean?

@CecileRobertMichon
Copy link
Contributor

I0706 16:50:14.164607 1 azuremachine_controller.go:246] controller/azuremachine/controllers.AzureMachineReconciler.reconcileNormal "msg"="Error state detected, skipping reconciliation" "name"="capz-e2e-5fjgdl-gpu-md-0-cv72d" "namespace"="capz-e2e-5fjgdl" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureMachine" "x-ms-correlation-request-id"="7ec00c4c-8c9e-49ef-a68b-7b8d6455ff10"

What does that mean?

The CAPZ controller doesn't reconciled failed AzureMachines because they are immutable. Once a machine is in "failed" state, the only way to recover it is to delete it and create a new one (or let the health check do it for you if you have one configured). The best way to know more about what happened would be to look at the FailureReason in the failed AzureMachine status.

@jackfrancis
Copy link
Contributor Author

@jackfrancis jackfrancis force-pushed the capz-periodic-flakes branch from 222cd89 to 3637cb7 Compare July 6, 2022 18:14
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 6, 2022
@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-exp

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 6, 2022
@k8s-ci-robot k8s-ci-robot merged commit 503acc7 into kubernetes-sigs:main Jul 6, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.4 milestone Jul 6, 2022
@jackfrancis jackfrancis deleted the capz-periodic-flakes branch December 9, 2022 22:39
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky periodic E2E test results
3 participants