Skip to content
This repository has been archived by the owner on Oct 10, 2023. It is now read-only.

Address OOM kills in addons-manager #3004

Merged
merged 3 commits into from
Jul 26, 2022

Conversation

vijaykatam
Copy link
Contributor

@vijaykatam vijaykatam commented Jul 22, 2022

  • Fewer concurrent reconciles for packageinstallstatus_controller
  • Higher ratelimiting for packageinstallstatus_controller
  • Bubble up errors in reconcile method so that retries back off
  • Remove timestamp addition in status
  • Allow enabling pprof
  • [Fix memory leak]: Reassigning variables remote tracker client variable which is a pointer causes the object to be tracked in memory even after its usage.
  • [Fix memory leak]: Remote tracker seems to be very sensitive with respect to how watch is called. Change watch to only include stateless functions.

Signed-off-by: Vijay Katam vkatam@vmware.com

What this PR does / why we need it

Address OOM kills in addons-manager due to memory leaks

Which issue(s) this PR fixes

Fixes #2963

Describe testing done for PR

Tested on a scale test cluster with 158 clusters. The pod has been running stable for over 7 hours without OOM.

k get pods -n vmware-system-tkg -o wide | grep addons
tanzu-addons-controller-manager-5b5658c9d6-sxzcz         1/1     Running   0               7h22m   10.102.0.2   420f40b588050e0ef6dd8503237a8f91   <none>           <none>

The memory stays around ~400 mb

root@420f40b588050e0ef6dd8503237a8f91 [ ~ ]# curl http://localhost:18317/metrics | grep process_resident_memory_bytes
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.03750912e+08
100   97k    0   97k    0     0  38.9M      0 --:--:-- --:--:-- --:--:-- 47.7M

Release note

package-based-lcm: fix memory leaks in packageinstallstatus_controller

Additional information

Before fix - notice packageinstallstatusreconciler

profile073

After fix Note that packageinstallstatusreconciler does not show up anymore
profile122

Special notes for your reviewer

* Fewer concurrent reconciles for packageinstallstatus_controller
* Higher ratelimiting for packageinstallstatus_controller
* Bubble up errors in reconcile method so that retries back off
* Remove timestamp addition in status
* Allow enabling pprof

Signed-off-by: Vijay Katam <vkatam@vmware.com>
@vijaykatam vijaykatam requested review from a team as code owners July 22, 2022 00:32
@codecov
Copy link

codecov bot commented Jul 22, 2022

Codecov Report

Merging #3004 (8fc4a78) into main (6d6ca50) will increase coverage by 0.12%.
The diff coverage is 70.39%.

@@            Coverage Diff             @@
##             main    #3004      +/-   ##
==========================================
+ Coverage   44.00%   44.13%   +0.12%     
==========================================
  Files         416      416              
  Lines       41852    42108     +256     
==========================================
+ Hits        18417    18583     +166     
- Misses      21714    21799      +85     
- Partials     1721     1726       +5     
Impacted Files Coverage Δ
addons/controllers/addon_controller.go 63.30% <ø> (-0.31%) ⬇️
pkg/v1/config/clientconfig.go 34.79% <ø> (ø)
pkg/v1/config/defaults.go 42.30% <0.00%> (-0.83%) ⬇️
pkg/v1/tkg/avi/client.go 5.07% <0.00%> (-0.24%) ⬇️
pkg/v1/tkg/client/cluster.go 14.45% <ø> (+0.19%) ⬆️
pkg/v1/tkg/client/init.go 0.00% <0.00%> (ø)
pkg/v1/tkg/tkgconfigproviders/vsphere.go 49.24% <52.38%> (+0.44%) ⬆️
...ons/controllers/packageinstallstatus_controller.go 79.15% <75.51%> (-1.50%) ⬇️
pkg/v1/tkg/client/validate.go 57.99% <76.87%> (+1.74%) ⬆️
addons/controllers/packageinstallstatus_handler.go 40.00% <100.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ab0604...8fc4a78. Read the comment docs.

@@ -378,11 +396,17 @@ func enableWebhooks(ctx context.Context, mgr ctrl.Manager, flags *addonFlags) {
}

func enablePackageInstallStatusController(ctx context.Context, mgr ctrl.Manager, flags *addonFlags) {
// set up a ClusterCacheTracker to provide to PackageInstallStatus controller which requires a connection to remote clusters
// the informers/caches are created only for objects accessed through Get/List in the code.
Copy link
Contributor

@maralavi maralavi Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first two lines of this comment are valid still. Why removing those in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back with updated comment. Note that by default remote tracker excludes configmap and resource.

addons/main.go Outdated
&corev1.ConfigMap{},
&corev1.Secret{},
&kapppkg.PackageInstall{},
&kappdatapkg.Package{},
Copy link
Contributor

@maralavi maralavi Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we add ConfigMap and Secret to the ClientUncachedObjects? We never read any of those types in PackageInstallStatus Controller code.

addons/main.go Outdated
ClientUncachedObjects: []client.Object{
&corev1.ConfigMap{},
&corev1.Secret{},
&kapppkg.PackageInstall{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't adding PackageInstall & Package to ClientUncachedObjects, mean that we'll not allow objects of those types to be cached anymore? Does that mean that we will be reading from api server for those now and that we'll be not using the cache functionality of remote watch anymore?

Status: pkgiCondition.Status,
Message: util.GetKappUsefulErrorMessage(pkgi.Status.UsefulErrorMessage),
Reason: pkgiCondition.Reason,
LastTransitionTime: metav1.NewTime(time.Now().UTC().Truncate(time.Second)),
Copy link
Contributor

@maralavi maralavi Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sad to see timestamp go, it was adding more insight to the reconciliation status. But I can imagine it can add to the churn a lot as well.

@@ -294,9 +294,6 @@ func (r *VSphereCSIConfigReconciler) reconcileVSphereCSIConfigNormal(ctx context
}
}

logger.Info(fmt.Sprintf("'%s' the secret '%s'", opResult,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing lot of unnecessary logs

}

// isPackageManaged checks if the provided PackageInstall is among the list of managed(core/additional) packages
func (r *PackageInstallStatusReconciler) isPackageManaged(clusterObjKey client.ObjectKey, pkgiName string) (bool, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we can find another way to address checking if a package is managed, I am removing this because it requires a client and putting a stateful function (one that has a ref to reconciler) is contributing towards memory leaks.

@@ -338,43 +335,43 @@ func (r *PackageInstallStatusReconciler) removeConditionIfExistsForPkgName(clust
}

// watchPackageInstalls sets a remote watch on the provided cluster on the Kind resource
func (r *PackageInstallStatusReconciler) watchPackageInstalls(cluster *clusterapiv1beta1.Cluster, log logr.Logger) error {
func watchPackageInstalls(ctx context.Context, watcher remote.Watcher, tracker *remote.ClusterCacheTracker, cluster *clusterapiv1beta1.Cluster, log logr.Logger) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made these functions not receivers of the PackageInstallStatusReconciler because these are passed to Watch in the remote tracker and is contributing to memory leaks.

installPackage(wlcCluster.Name, "pkg.test.carvel.dev.1.0.0", wlcCluster.Namespace)
wlcClusterBootstrap := clusterBootstrapGet(client.ObjectKeyFromObject(wlcCluster))
Expect(len(wlcClusterBootstrap.Status.Conditions)).Should(Equal(0))
//By("verifying un-managed packages do not update the 'Status.Conditions' for ClusterBootstrap")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can restore this test after we figure out a better way to check managed packages.

1. Reassigning variables remote tracker client variable which is a pointer causes
 the object to be tracked in memory
2. Remote tracker seems to be very sensitive with respect to how watch is called. Change
watch to only include stateless functions.

Signed-off-by: Vijay Katam <vkatam@vmware.com>
@vijaykatam vijaykatam force-pushed the addons_memory_leak branch from 2165ed0 to 796e9f4 Compare July 25, 2022 02:02
@vijaykatam vijaykatam changed the title [Draft]: Address OOM kills in addons-manager Address OOM kills in addons-manager Jul 25, 2022
@maralavi maralavi force-pushed the addons_memory_leak branch 2 times, most recently from affedbc to f178b2b Compare July 26, 2022 00:59
@maralavi maralavi force-pushed the addons_memory_leak branch from f178b2b to 8fc4a78 Compare July 26, 2022 01:01
@maralavi maralavi added the ok-to-merge PRs should be labelled with this before merging label Jul 26, 2022
@maralavi maralavi merged commit daca587 into vmware-tanzu:main Jul 26, 2022
patchHelper, err := clusterapipatchutil.NewHelper(clusterBootstrap, r.Client)
if err != nil {
errorList = append(errorList, errors.Wrap(err, "error patching ClusterBootstrapStatus"))
retErr = kerrors.NewAggregate(errorList)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vijaykatam This block needed to be outside of the defer, the tests were failing sue to this change.

abhijit-dev82 pushed a commit to abhijit-dev82/tanzu-framework that referenced this pull request Jul 27, 2022
* Fix memory leaks in packageinstallstatus_controller

Signed-off-by: Vijay Katam <vkatam@vmware.com>

Update capabilities.go

Detect TKGS environment for any version of TKC API
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
cla-not-required ok-to-merge PRs should be labelled with this before merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

addons manager out of memory for ~ 200 clusters
3 participants