Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Delete metrics for deleted pods #1279

Merged
merged 5 commits into from
Aug 30, 2023

Conversation

lambdanis
Copy link
Contributor

@lambdanis lambdanis commented Jul 27, 2023

Some of the exposed metrics have "pod" label, which contains the name of the
monitored pod. So far when a pod got deleted, Tetragon kept exposing stale
metrics for it. This was causing continuous increase in memory usage in
Tetragon agent as well as in the metrics scraper.

This PR fixes the issue. Now if metrics and k8s API are both enabled then
an additional pod hook gets registered that on pod deletion deletes metrics
associated with it.

When a pod gets deleted, Tetragon stops exposing metrics associated with it.

@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 2 times, most recently from ab39f6c to b55de47 Compare July 31, 2023 14:34
@jrfastab
Copy link
Contributor

jrfastab commented Aug 2, 2023

Just an idea would there be a way to do the delete on a timer so the metrics are not immediately lost? Then we could program some time to ensure that when we poll the metrics we get at least one poll before they are deleted? Maybe alternatively mark them ready for delete and only delete them after a read?

@jrfastab
Copy link
Contributor

jrfastab commented Aug 2, 2023

I tend to like the idea where they get marked for delete on next read when the pod is removed then we don't need another timer value to tune.

@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 3 times, most recently from c77835a to e5bc30a Compare August 3, 2023 20:15
@lambdanis
Copy link
Contributor Author

Just an idea would there be a way to do the delete on a timer so the metrics are not immediately lost? Then we could program some time to ensure that when we poll the metrics we get at least one poll before they are deleted?

@jrfastab Do you have a good use case for when this last value would be really important? This is in a TODO, yes, but I'm debating how useful it is.

Maybe alternatively mark them ready for delete and only delete them after a read?

This would mean that if the scraper is not working then Tetragon is never deleting stale metrics. It's kinda against the pull design of Prometheus IMO - the idea is that Tetragon exposes a metrics endpoint and whether something is reading it or not shouldn't be its concern.

@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 3 times, most recently from 2de7d4d to abe499b Compare August 7, 2023 15:14
@lambdanis lambdanis changed the title [WIP] Delete metrics for deleted pods Delete metrics for deleted pods Aug 7, 2023
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from abe499b to e98ad7a Compare August 7, 2023 15:16
@lambdanis lambdanis added area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change labels Aug 7, 2023
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 4 times, most recently from 636200d to cdbcbb3 Compare August 8, 2023 17:56
@lambdanis
Copy link
Contributor Author

Just an idea would there be a way to do the delete on a timer so the metrics are not immediately lost? Then we could program some time to ensure that when we poll the metrics we get at least one poll before they are deleted?

Ok, I implemented this using workqueue.DelayingInterface. But I hit an import cycle, so CI is red. I'll refactor in a separate PR to resolve it.

@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 7 times, most recently from cbb0424 to 20e9a2c Compare August 9, 2023 11:58
@lambdanis lambdanis marked this pull request as ready for review August 9, 2023 11:58
@lambdanis lambdanis requested a review from a team as a code owner August 9, 2023 11:58
@lambdanis lambdanis marked this pull request as draft August 9, 2023 12:37
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 20e9a2c to 8712376 Compare August 9, 2023 12:40
@lambdanis lambdanis marked this pull request as ready for review August 9, 2023 12:40
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 8712376 to 8e8583a Compare August 9, 2023 12:53
@lambdanis lambdanis marked this pull request as draft August 9, 2023 13:03
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 2 times, most recently from 05f85a1 to 97679ce Compare August 9, 2023 14:56
@netlify
Copy link

netlify bot commented Aug 9, 2023

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 3fb8d8d
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/64ee2a2cc51bf4000870921e
😎 Deploy Preview https://deploy-preview-1279--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@lambdanis lambdanis marked this pull request as ready for review August 9, 2023 19:20
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 97679ce to 726ce8f Compare August 9, 2023 19:45
var (
metricsWithPod []*prometheus.MetricVec
podQueue workqueue.DelayingInterface
deleteDelay = 1 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow up PR we can make this configurable

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/podhooks/podhooks.go Outdated Show resolved Hide resolved
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 3 times, most recently from 4a39c4a to 80df0bf Compare August 17, 2023 15:32
Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Some of the exposed metrics have "pod" label, which contains the name of the
monitored pod. So far when a pod got deleted, Tetragon kept exposing stale
metrics for it. This was causing continuous increase in memory usage in
Tetragon agent as well as in the metrics scraper.

This commit fixes the issue. Now if metrics and k8s API are both enabled then
an additional pod hook gets registered that on pod deletion deletes metrics
associated with it.

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch 2 times, most recently from 3fb8d8d to 82b1da6 Compare August 29, 2023 18:32
Instead of deleting metrics immediately after a pod is deleted, use a workqueue
to delay the deletion a minute. This allows the scraper to scrape last values
of the metrics. It's particularly useful when the cluster has short-lived pods
- with immediate deletion the scraper could completely miss metrics for them.

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
This commit moves InitAllMetrics function from pkg/metrics to a separate
package, pkg/metrics/config. This is done to avoid importing all packages that
define metrics in pkg/metrics. pkg/metrics should act as a library and can be
imported by packages that define metrics, but not the other way round.

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
The intention is to make it easier to register metrics that should be cleaned
up on pod deletion.

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
@lambdanis lambdanis force-pushed the pr/lambdanis/zombie-metrics branch from 82b1da6 to 2e1041b Compare August 29, 2023 19:03
@lambdanis lambdanis requested a review from kkourt August 29, 2023 19:21
Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants