Skip to content
This repository has been archived by the owner on Oct 10, 2023. It is now read-only.

Graceful upgrade of addons-manager #3229

Merged
merged 1 commit into from
Sep 20, 2022

Conversation

adduarte
Copy link

@adduarte adduarte commented Aug 31, 2022

Graceful upgrade of addons-manager

- Pauses lifecycle management of addons-manager package
- NoopDelete addons-manager packageinstall
- Deletes adoons-manager addon secret

What this PR does / why we need it

Adds necessary functions and logic to gracefully upgrade addons-manager package from the package provided by core repository to the addons-manager provided by the management repository.

logic flow:
pause lifecycle management for addons-manager
pause reconciliation for existing pkgi
pause reconciliation for existing addonsSecret
noop delete of existing addons-manger pkgi
wait for succefull deployment of management packages (inluding new addons-manager)
remove old addonsSecret for old addons-manager

Which issue(s) this PR fixes

Fixes #3265

Describe testing done for PR

  • Manual testing of PauseAddonLifecycleManagement, NoopDeletePackageInstall, and DeleteAddonSecret
  • Unit testing of PauseAddonLifecycleManagement, NoopDeletePackageInstall, and DeleteAddonSecret

Release note

Upgrade to using addons-manager package from management repository
During upgrade previous addons-managerr lifecycle is paused, and if upgrade is succesful  previous addons-manager addon secret is remove to prevent it being reinstalled by reconciliation logic. 

Additional information

Special notes for your reviewer

Testing done

Deployed 1.6 GA on to vpshere

set the following env variables

export _MANAGEMENT_PACKAGE_REPO_IMAGE=gcr.io/eminent-nation-87317/tanzu_framework/github-actions/main/management:v0.21.0
export _MANAGEMENT_PACKAGE_VERSION=0.21.0

ran "tanzu mc upgrade -y"
results:

Results:

aduarte@aduarte-a01 poc1 % kubectl get pkgi -n tkg-system tanzu-addons-manager
NAME                   PACKAGE NAME                      PACKAGE VERSION   DESCRIPTION           AGE
tanzu-addons-manager   addons-manager.tanzu.vmware.com   0.21.0            Reconcile succeeded   36m

and pods where generated with matching image

@adduarte adduarte requested review from a team as code owners August 31, 2022 18:59
@github-actions
Copy link

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220831190902/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@codecov
Copy link

codecov bot commented Aug 31, 2022

Codecov Report

Merging #3229 (458ccb0) into main (fb71679) will decrease coverage by 4.99%.
The diff coverage is n/a.

❗ Current head 458ccb0 differs from pull request most recent head 872fbf7. Consider uploading reports for the commit 872fbf7 to get more accurate results

@@            Coverage Diff             @@
##             main    #3229      +/-   ##
==========================================
- Coverage   51.63%   46.64%   -5.00%     
==========================================
  Files         122      281     +159     
  Lines       11197    29654   +18457     
==========================================
+ Hits         5782    13831    +8049     
- Misses       4938    14567    +9629     
- Partials      477     1256     +779     
Impacted Files Coverage Δ
pkg/v1/tkg/tkgctl/compatibility.go 70.00% <0.00%> (ø)
pkg/v1/builder/command/publish/local.go 0.00% <0.00%> (ø)
cmd/cli/plugin-admin/builder/publish.go 0.00% <0.00%> (ø)
pkg/v1/tkg/clusterclient/clusterclient.go 49.28% <0.00%> (ø)
pkg/v1/tkg/tkgconfigupdater/ensure.go 14.98% <0.00%> (ø)
pkg/v1/sdk/features/util/namespace_selector.go 80.00% <0.00%> (ø)
pkg/v1/tkg/cmd/update_credentials_cluster.go 25.00% <0.00%> (ø)
pkg/v1/tkg/tkgconfigbom/bom.go 49.80% <0.00%> (ø)
pkg/v1/tkg/client/get_kubernetes_versions.go 32.25% <0.00%> (ø)
pkg/v1/tkg/cmd/version.go 20.00% <0.00%> (ø)
... and 149 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@adduarte adduarte force-pushed the control_package_lifecycle branch from e76ef6d to bacd323 Compare September 1, 2022 08:18
@github-actions
Copy link

github-actions bot commented Sep 1, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220901082714/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@github-actions
Copy link

github-actions bot commented Sep 1, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220901091247/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from 638cdc4 to fb327da Compare September 1, 2022 09:17
@github-actions
Copy link

github-actions bot commented Sep 1, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220901094216/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@github-actions
Copy link

github-actions bot commented Sep 1, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220901103031/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch 2 times, most recently from 48a065b to 0131e71 Compare September 2, 2022 08:27
@github-actions
Copy link

github-actions bot commented Sep 2, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220902082815/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from 0131e71 to 8bc9b89 Compare September 2, 2022 08:30
@github-actions
Copy link

github-actions bot commented Sep 2, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220902083407/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@github-actions
Copy link

github-actions bot commented Sep 2, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220902083702/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from 8bc9b89 to 5416bd0 Compare September 2, 2022 18:39
@github-actions
Copy link

github-actions bot commented Sep 2, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220902184632/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from 5416bd0 to 9aedbd0 Compare September 3, 2022 09:23
@adduarte adduarte changed the title WIP: Graceful upgrade of addons-manager Graceful upgrade of addons-manager Sep 3, 2022
@adduarte adduarte force-pushed the control_package_lifecycle branch from 9aedbd0 to e50ddfa Compare September 3, 2022 09:26
@github-actions
Copy link

github-actions bot commented Sep 3, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220903093350/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@github-actions
Copy link

github-actions bot commented Sep 3, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220903093409/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from e50ddfa to fd3db31 Compare September 5, 2022 05:47
@github-actions
Copy link

github-actions bot commented Sep 5, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220905055604/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from fd3db31 to 231b6ca Compare September 6, 2022 22:59
@github-actions
Copy link

github-actions bot commented Sep 6, 2022

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220906230649/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from b93c242 to 2986846 Compare September 15, 2022 02:53
@github-actions
Copy link

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220915030400/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte added the ok-to-merge PRs should be labelled with this before merging label Sep 15, 2022
@adduarte adduarte requested a review from anujc25 September 15, 2022 16:40
@adduarte adduarte force-pushed the control_package_lifecycle branch from 2986846 to f4c9f62 Compare September 15, 2022 23:21
@github-actions
Copy link

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220915233741/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

Copy link
Contributor

@vijaykatam vijaykatam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. I think cluster upgrade should be repeatable in case of errors that happen midway(network timeouts, what not) so lets ensure this code is idempotent.

@adduarte adduarte added do-not-merge/hold Some fixes necessary, hold for merging and removed ok-to-merge PRs should be labelled with this before merging labels Sep 16, 2022
@adduarte adduarte force-pushed the control_package_lifecycle branch from f4c9f62 to 289b6c1 Compare September 16, 2022 05:33
@github-actions
Copy link

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220916054307/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte added ok-to-merge PRs should be labelled with this before merging and removed do-not-merge/hold Some fixes necessary, hold for merging labels Sep 16, 2022
@adduarte adduarte requested a review from tenczar September 16, 2022 21:07
@adduarte
Copy link
Author

Succesfully tested for upgrade repeatable in case of failure.

Copy link
Contributor

@vijaykatam vijaykatam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check out why capd is failing?

return err
}

err = NoopDeletePackageInstall(clusterClient, addonsManagerName, constants.TkgNamespace)
Copy link
Contributor

@vuil vuil Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: how severe is failure in any of the 3 steps (PauseAddon.. NoopDelete.., DeleteAddon...)? I wonder if there is any that is harmless enough to just log and allow larger operation (likely the MC upgrade here) to succeed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a failure here but we continue with upgrade, it will leave us with a system in an unknown state.
as best we can predict we will would end up with the wrong version of addons-manager (at best) at worst we end up with two... Its really dificult to predict thus why we though it best to stop.
If a failure occurs and we have to stop upgrade, the upgrade can be restarted. The code is meant to be idempotent, and local testing on a vsphere cluster, has so far shown that it is...i.e. I can run the upgrade over and over in case of a failure, and the code results in the correct deployment of pkgis.

I think is riskier to continue if we fail because the user would have a lot of cleanning up to do if we don't stop the upgrade.

@vuil
Copy link
Contributor

vuil commented Sep 16, 2022

posted a question.
cc @anujc25 you may want to comment on the changes to the InstallManagementComponents flow

@adduarte nit: might be good idea to also mention what got cleaned up in the "Release Notes" section

Copy link
Contributor

@tenczar tenczar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment from me.


err = pausePackageInstallReconciliation(clusterClient, pkgiName, namespace)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get here and this error occurs, that means the secret reconciliation has been paused, but presumably on the package install? Are there remediation steps? what happens if someone tries the operation again? Will this fail at the pauseAddonSecret step?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail at this point, then the remediation step would be to retry the upgrade.
If total failure happens and user changes their mind about upgrade, they will have to unpause the addons secret reconciliation by hand. We could try to "roll back" the addons secret pause, but it is difficult to predict the state of the system with enough accuracy to ensure a succesfull "roll back" . Perhaps is best to let the user explore why the failure and unpause by hand. We do put out a log that we are pausing the addons secret. Perhaps we should put a log here saying that user needs to check system and unpause secret.

What will happen then is that the addonsecret is paused so there is no reconcilation of old addons-manager.
This would not block any new attempts to upgrade. We tested locally that scenario and the upgrade is "succesful" if you restart it after a failure at this point.

The only other scenario is if the user decides to not do the upgrade after a failure here, which means they would have to manually unpause the lifecycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then it probably makes sense to add that log message when this second pause fails, and then filing a docs bug to make sure that the steps for recovery are documented when a user doesn't just retry the upgrade.

@adduarte adduarte force-pushed the control_package_lifecycle branch from 289b6c1 to e3f96a7 Compare September 20, 2022 00:25
@github-actions
Copy link

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220920004016/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte force-pushed the control_package_lifecycle branch from e3f96a7 to 3cd7cfe Compare September 20, 2022 01:06
@github-actions
Copy link

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220920011728/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

Before deploying the addons-manager package
checks to see if addons-manager is installed and
is moving repositories.

If so then:
- Pauses lifecycle management of addons-manager package
- NoopDelete addons-manager packageinstall
- Deletes adoons-manager addon secret
@adduarte adduarte force-pushed the control_package_lifecycle branch from 3cd7cfe to 872fbf7 Compare September 20, 2022 02:03
@github-actions
Copy link

Cluster Generation A/B Results:
https://storage.googleapis.com/tkg-clustergen/3229/20220920021454/clustergen.diff.txt
Author/reviewers:
Please review to verify that the effects on the generated cluster configurations are exactly what the PR intended, and give a thumbs-up if so.

@adduarte adduarte merged commit 142aa6f into vmware-tanzu:main Sep 20, 2022
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
cla-not-required ok-to-merge PRs should be labelled with this before merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Logic and code are needed to upgrade addons-manager from core to manage repository
6 participants