Delete metrics for deleted pods #1279

lambdanis · 2023-07-27T15:18:08Z

Some of the exposed metrics have "pod" label, which contains the name of the
monitored pod. So far when a pod got deleted, Tetragon kept exposing stale
metrics for it. This was causing continuous increase in memory usage in
Tetragon agent as well as in the metrics scraper.

This PR fixes the issue. Now if metrics and k8s API are both enabled then
an additional pod hook gets registered that on pod deletion deletes metrics
associated with it.

When a pod gets deleted, Tetragon stops exposing metrics associated with it.

jrfastab · 2023-08-02T18:01:36Z

Just an idea would there be a way to do the delete on a timer so the metrics are not immediately lost? Then we could program some time to ensure that when we poll the metrics we get at least one poll before they are deleted? Maybe alternatively mark them ready for delete and only delete them after a read?

jrfastab · 2023-08-02T18:02:27Z

I tend to like the idea where they get marked for delete on next read when the pod is removed then we don't need another timer value to tune.

lambdanis · 2023-08-03T21:52:57Z

Just an idea would there be a way to do the delete on a timer so the metrics are not immediately lost? Then we could program some time to ensure that when we poll the metrics we get at least one poll before they are deleted?

@jrfastab Do you have a good use case for when this last value would be really important? This is in a TODO, yes, but I'm debating how useful it is.

Maybe alternatively mark them ready for delete and only delete them after a read?

This would mean that if the scraper is not working then Tetragon is never deleting stale metrics. It's kinda against the pull design of Prometheus IMO - the idea is that Tetragon exposes a metrics endpoint and whether something is reading it or not shouldn't be its concern.

lambdanis · 2023-08-08T18:08:09Z

Just an idea would there be a way to do the delete on a timer so the metrics are not immediately lost? Then we could program some time to ensure that when we poll the metrics we get at least one poll before they are deleted?

Ok, I implemented this using workqueue.DelayingInterface. But I hit an import cycle, so CI is red. I'll refactor in a separate PR to resolve it.

netlify · 2023-08-09T14:57:05Z

✅ Deploy Preview for tetragon ready!

Name	Link
🔨 Latest commit	`3fb8d8d`
🔍 Latest deploy log	https://app.netlify.com/sites/tetragon/deploys/64ee2a2cc51bf4000870921e
😎 Deploy Preview	https://deploy-preview-1279--tetragon.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jrfastab · 2023-08-09T20:35:23Z

pkg/metrics/metrics.go

+var (
+	metricsWithPod []*prometheus.MetricVec
+	podQueue       workqueue.DelayingInterface
+	deleteDelay    = 1 * time.Minute


In a follow up PR we can make this configurable

pkg/metrics/metrics.go

pkg/podhooks/podhooks.go

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>

Some of the exposed metrics have "pod" label, which contains the name of the monitored pod. So far when a pod got deleted, Tetragon kept exposing stale metrics for it. This was causing continuous increase in memory usage in Tetragon agent as well as in the metrics scraper. This commit fixes the issue. Now if metrics and k8s API are both enabled then an additional pod hook gets registered that on pod deletion deletes metrics associated with it. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>

Instead of deleting metrics immediately after a pod is deleted, use a workqueue to delay the deletion a minute. This allows the scraper to scrape last values of the metrics. It's particularly useful when the cluster has short-lived pods - with immediate deletion the scraper could completely miss metrics for them. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>

This commit moves InitAllMetrics function from pkg/metrics to a separate package, pkg/metrics/config. This is done to avoid importing all packages that define metrics in pkg/metrics. pkg/metrics should act as a library and can be imported by packages that define metrics, but not the other way round. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>

The intention is to make it easier to register metrics that should be cleaned up on pod deletion. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>

kkourt

Thanks!

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 2 times, most recently from ab39f6c to b55de47 Compare July 31, 2023 14:34

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 3 times, most recently from c77835a to e5bc30a Compare August 3, 2023 20:15

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 3 times, most recently from 2de7d4d to abe499b Compare August 7, 2023 15:14

lambdanis changed the title ~~[WIP] Delete metrics for deleted pods~~ Delete metrics for deleted pods Aug 7, 2023

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from abe499b to e98ad7a Compare August 7, 2023 15:16

lambdanis added area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change labels Aug 7, 2023

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 4 times, most recently from 636200d to cdbcbb3 Compare August 8, 2023 17:56

lambdanis mentioned this pull request Aug 8, 2023

Move observer_test_helpers.go to a separate package #1335

Merged

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 7 times, most recently from cbb0424 to 20e9a2c Compare August 9, 2023 11:58

lambdanis marked this pull request as ready for review August 9, 2023 11:58

lambdanis requested a review from a team as a code owner August 9, 2023 11:58

lambdanis marked this pull request as draft August 9, 2023 12:37

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 20e9a2c to 8712376 Compare August 9, 2023 12:40

lambdanis marked this pull request as ready for review August 9, 2023 12:40

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 8712376 to 8e8583a Compare August 9, 2023 12:53

lambdanis marked this pull request as draft August 9, 2023 13:03

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 2 times, most recently from 05f85a1 to 97679ce Compare August 9, 2023 14:56

lambdanis marked this pull request as ready for review August 9, 2023 19:20

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 97679ce to 726ce8f Compare August 9, 2023 19:45

jrfastab reviewed Aug 9, 2023

View reviewed changes

kkourt reviewed Aug 10, 2023

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

kkourt reviewed Aug 10, 2023

View reviewed changes

pkg/podhooks/podhooks.go Outdated Show resolved Hide resolved

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 3 times, most recently from 4a39c4a to 80df0bf Compare August 17, 2023 15:32

jrfastab approved these changes Aug 18, 2023

View reviewed changes

lambdanis added 2 commits August 29, 2023 16:29

podhooks: Improve docstrings and error message

12293ed

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 2 times, most recently from 3fb8d8d to 82b1da6 Compare August 29, 2023 18:32

lambdanis added 3 commits August 29, 2023 19:59

metrics: Add helper functions to register metrics with pod label

2e1041b

The intention is to make it easier to register metrics that should be cleaned up on pod deletion. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>

lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 82b1da6 to 2e1041b Compare August 29, 2023 19:03

lambdanis requested a review from kkourt August 29, 2023 19:21

kkourt approved these changes Aug 30, 2023

View reviewed changes

lambdanis merged commit 7c4280c into cilium:main Aug 30, 2023

kkourt mentioned this pull request Sep 19, 2023

Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485

Closed

lambdanis mentioned this pull request Apr 25, 2024

Improve metrics library to enforce best practices #2376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete metrics for deleted pods #1279

Delete metrics for deleted pods #1279

lambdanis commented Jul 27, 2023 •

edited

Loading

jrfastab commented Aug 2, 2023

jrfastab commented Aug 2, 2023

lambdanis commented Aug 3, 2023

lambdanis commented Aug 8, 2023

netlify bot commented Aug 9, 2023 •

edited

Loading

jrfastab Aug 9, 2023

kkourt left a comment

Delete metrics for deleted pods #1279

Delete metrics for deleted pods #1279

Conversation

lambdanis commented Jul 27, 2023 • edited Loading

jrfastab commented Aug 2, 2023

jrfastab commented Aug 2, 2023

lambdanis commented Aug 3, 2023

lambdanis commented Aug 8, 2023

netlify bot commented Aug 9, 2023 • edited Loading

✅ Deploy Preview for tetragon ready!

jrfastab Aug 9, 2023

Choose a reason for hiding this comment

kkourt left a comment

Choose a reason for hiding this comment

lambdanis commented Jul 27, 2023 •

edited

Loading

netlify bot commented Aug 9, 2023 •

edited

Loading