-
Notifications
You must be signed in to change notification settings - Fork 192
Graceful upgrade of addons-manager #3229
Graceful upgrade of addons-manager #3229
Conversation
Cluster Generation A/B Results: |
Codecov Report
@@ Coverage Diff @@
## main #3229 +/- ##
==========================================
- Coverage 51.63% 46.64% -5.00%
==========================================
Files 122 281 +159
Lines 11197 29654 +18457
==========================================
+ Hits 5782 13831 +8049
- Misses 4938 14567 +9629
- Partials 477 1256 +779
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
e76ef6d
to
bacd323
Compare
Cluster Generation A/B Results: |
Cluster Generation A/B Results: |
638cdc4
to
fb327da
Compare
Cluster Generation A/B Results: |
Cluster Generation A/B Results: |
48a065b
to
0131e71
Compare
Cluster Generation A/B Results: |
0131e71
to
8bc9b89
Compare
Cluster Generation A/B Results: |
Cluster Generation A/B Results: |
8bc9b89
to
5416bd0
Compare
Cluster Generation A/B Results: |
5416bd0
to
9aedbd0
Compare
9aedbd0
to
e50ddfa
Compare
Cluster Generation A/B Results: |
Cluster Generation A/B Results: |
e50ddfa
to
fd3db31
Compare
Cluster Generation A/B Results: |
fd3db31
to
231b6ca
Compare
Cluster Generation A/B Results: |
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install_test.go
Outdated
Show resolved
Hide resolved
b93c242
to
2986846
Compare
Cluster Generation A/B Results: |
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
2986846
to
f4c9f62
Compare
Cluster Generation A/B Results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. I think cluster upgrade should be repeatable in case of errors that happen midway(network timeouts, what not) so lets ensure this code is idempotent.
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
pkg/v1/tkg/managementcomponents/management_component_install.go
Outdated
Show resolved
Hide resolved
f4c9f62
to
289b6c1
Compare
Cluster Generation A/B Results: |
Succesfully tested for upgrade repeatable in case of failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check out why capd is failing?
return err | ||
} | ||
|
||
err = NoopDeletePackageInstall(clusterClient, addonsManagerName, constants.TkgNamespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: how severe is failure in any of the 3 steps (PauseAddon.. NoopDelete.., DeleteAddon...)? I wonder if there is any that is harmless enough to just log and allow larger operation (likely the MC upgrade here) to succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a failure here but we continue with upgrade, it will leave us with a system in an unknown state.
as best we can predict we will would end up with the wrong version of addons-manager (at best) at worst we end up with two... Its really dificult to predict thus why we though it best to stop.
If a failure occurs and we have to stop upgrade, the upgrade can be restarted. The code is meant to be idempotent, and local testing on a vsphere cluster, has so far shown that it is...i.e. I can run the upgrade over and over in case of a failure, and the code results in the correct deployment of pkgis.
I think is riskier to continue if we fail because the user would have a lot of cleanning up to do if we don't stop the upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment from me.
|
||
err = pausePackageInstallReconciliation(clusterClient, pkgiName, namespace) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get here and this error occurs, that means the secret reconciliation has been paused, but presumably on the package install? Are there remediation steps? what happens if someone tries the operation again? Will this fail at the pauseAddonSecret step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fail at this point, then the remediation step would be to retry the upgrade.
If total failure happens and user changes their mind about upgrade, they will have to unpause the addons secret reconciliation by hand. We could try to "roll back" the addons secret pause, but it is difficult to predict the state of the system with enough accuracy to ensure a succesfull "roll back" . Perhaps is best to let the user explore why the failure and unpause by hand. We do put out a log that we are pausing the addons secret. Perhaps we should put a log here saying that user needs to check system and unpause secret.
What will happen then is that the addonsecret is paused so there is no reconcilation of old addons-manager.
This would not block any new attempts to upgrade. We tested locally that scenario and the upgrade is "succesful" if you restart it after a failure at this point.
The only other scenario is if the user decides to not do the upgrade after a failure here, which means they would have to manually unpause the lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then it probably makes sense to add that log message when this second pause fails, and then filing a docs bug to make sure that the steps for recovery are documented when a user doesn't just retry the upgrade.
289b6c1
to
e3f96a7
Compare
Cluster Generation A/B Results: |
e3f96a7
to
3cd7cfe
Compare
Cluster Generation A/B Results: |
Before deploying the addons-manager package checks to see if addons-manager is installed and is moving repositories. If so then: - Pauses lifecycle management of addons-manager package - NoopDelete addons-manager packageinstall - Deletes adoons-manager addon secret
3cd7cfe
to
872fbf7
Compare
Cluster Generation A/B Results: |
Graceful upgrade of addons-manager
What this PR does / why we need it
Adds necessary functions and logic to gracefully upgrade addons-manager package from the package provided by core repository to the addons-manager provided by the management repository.
logic flow:
pause lifecycle management for addons-manager
pause reconciliation for existing pkgi
pause reconciliation for existing addonsSecret
noop delete of existing addons-manger pkgi
wait for succefull deployment of management packages (inluding new addons-manager)
remove old addonsSecret for old addons-manager
Which issue(s) this PR fixes
Fixes #3265
Describe testing done for PR
Release note
Additional information
Special notes for your reviewer
Testing done
Deployed 1.6 GA on to vpshere
set the following env variables
ran "tanzu mc upgrade -y"
results:
Results: