feat(alerts): KubeNodePressure and KubeNodeEviction #1014

TheRealNoob · 2025-01-24T19:47:11Z

This PR is (mostly) a copy of #760. As best I can tell it was closed because they wanted to route eviction alerts to individual teams rather than to infra. I want to reopen this conversation on the grounds that I believe it's more correct to route to infra on the basis that eviction scenarios have the potential to impact the entire cluster. Furthermore, it's very possible that the evicted pod (chosen by kubelet) isn't the one applying undue pressure to the Node - in this scenario we would be alerting the team that owns the evicted pod since they're impacted, but they wouldn't be the correct team to fix the issue. There may be further discussion points but these are the big ones in my mind.

There's some parts that I'd appreciate feedback on

I don't think the appending of {{cluster}} label onto the description is working, or is worded the best it could be.
I read the runbook.md is supposed to be auto-generated but I'm not seeing that behavior.
I think a runbook needs to be created for prometheus-operator. Can someone point me to the repo/path that needs a PR?
I haven't yet tested the 0.002 threshold from the previous PR, we'll see if maintainers are open to this alert before spending time on testing.

Signed-off-by: TheRealNoob <mike1118@live.com>

skl · 2025-01-30T12:22:07Z

I read the runbook.md is supposed to be auto-generated but I'm not seeing that behavior.

It is not auto-generated as far as I'm aware and it seems outdated. I think it's been manually maintained.

skl · 2025-01-30T12:23:31Z

I think a runbook needs to be created for prometheus-operator. Can someone point me to the repo/path that needs a PR?

Check here:

https://runbooks.prometheus-operator.dev/docs/add-runbook/

alerts/resource_alerts.libsonnet

skl · 2025-01-30T12:33:41Z

Assuming feedback is positive on this new alert, please create a test in https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/tests.yaml - I can help here if required.

I going to see if I can get some more eyes on this PR for wider feedback.

Co-authored-by: Stephen Lang <skl@users.noreply.github.com>

alerts/resource_alerts.libsonnet

TheRealNoob · 2025-02-04T02:47:22Z

still working on this. It's surprisingly difficult to reproduce an eviction. I'm also seeing this weird behavior where kubelet isn't exposing the metric kubelet_evictions even though I know I've seen it before and it's documented. Trying to understand what's going on before proceeding.

skl · 2025-02-04T10:57:32Z

Yeah, no rush, take your time! From what I can see kubelet_evictions is only exported when the value is non-zero. So if there are no evictions according to kubelet, the metric isn't exported.

I've used this helm chart in the past to get nodes into failure scenarios:

https://github.com/weinong/stress-helm-chart

I did have to change one line in templates/deployment.yaml to get it working:

- apiVersion: apps/v1beta2
+ apiVersion: apps/v1

Maybe that would help reproduce natural evictions. Otherwise, maybe if you drain a node that would also cause evictions?

skl · 2025-02-04T10:59:45Z

Oh and by the way, I've been told that kubelet_evictions doesn't work for API-initiated evictions, so I presume it only works for Node-pressure evictions.

skl · 2025-02-04T19:13:08Z

Linking the kubernetes slack thread here for posterity.

The TL;DR is that there may be a need for two related info-severity alerts, one for evictions and another for node pressure conditions. Neither is necessarily actionable but both can help build up a picture of a node at a given time, which is especially useful when RCA'ing application errors that correlate to a pod eviction.

TheRealNoob · 2025-02-05T21:54:34Z

So I've spent some time now working on redoing the eviction rate alert to account for the non-zero kubelet_evictions value. If we went with an alert as simple as rate(kubelet_evictions[15m]) > 0 then it wouldn't alert on the very first eviction, since the the first time the metric will have appeared is at value 1, meaning there's no rate of change.

In my head so far, the ideal alert follows logic like so. This isn't valid syntax but more like a pseudo-syntax.

  rate(kubelet_evictions[15m]) by(eviction_signal) > 0
or
    kubelet_evictions =1
  and on(*)
    10 minutes ago the same timeseries (same labels) was absent

It's clear to me at this point that the only way to even possibly account for this scenario in the alert definition is to use the or on() vector(0) or absent() functions. But then you also need to group_left(node) those the output from those functions into the each output from kubelet_evictions in order for any lookback to be effective. Before I go any further let me use an example:

rate(
  (
    sum by (eviction_signal) (kubelet_evictions) or on () vector(0)
  )[10m:30s]
)

This is a minikube cluster so single node, and the timeframe shown is when the first eviction occured (13:53). You can see the query yields two timeseries, one for kubelet_evictions and another for vector(0). If you want to do a lookbehind then the two have to be grouped into a single timeseries. Additionally to take it even further, you need to check whether the metric simply didn't exist, or if the node went offline, so you should do some form and up{job=kubelet} in there somewhere. At least I think so -- My theory is a bit ahead of my practical testing at this point as I'm really questioning whether it's worth this amount of work just to alert on the first eviction in the entire cluster.

TheRealNoob · 2025-02-05T22:45:23Z

Lastly while still on the topic of the eviction alert, I have two thoughts and they're heavily inter-dependent, so anxious to hear your thoughts.

I think it still makes sense to use rate(...[15m]) just to expand the time window. It can take a few minutes for a pod to be rescheduled and go through startup. If the window is set too low then you'll have a flapping alert.
I think it makes sense to set the comparison at > 0. Originally I didn't feel this way, but since this is severity=info, it feels harmless. I imagine anyone who increases the severity wants to be notified of all. Or they make a second alert with a high severity and threshold. Plus, it will mean that a 15 minute window triggers immediately. If you were to try to set the threshold at >1 pod per 15 minutes 1 pod / 900 (seconds) == 0.00111111111 then you would have to wait the full 15. Obviously nobody wants to wait 15, so you'd lower the window, but that increases the probability of flapping.

Signed-off-by: TheRealNoob <mike1118@live.com>

TheRealNoob · 2025-02-06T22:07:41Z

ready for re-review

Signed-off-by: TheRealNoob <mike1118@live.com>

alerts/kubelet.libsonnet

tests.yaml

this turned out to be a good chance because it made me realize there was an additional label value here that wasn't be handled. Signed-off-by: TheRealNoob <mike1118@live.com>

Co-authored-by: Stephen Lang <skl@users.noreply.github.com>

alerts/kubelet.libsonnet

skl · 2025-02-07T18:55:48Z

FYI the tests have moved into tests/ directory since:

tests: add more tests #1002

Co-authored-by: Stephen Lang <skl@users.noreply.github.com>

Signed-off-by: TheRealNoob <mike1118@live.com>

TheRealNoob · 2025-02-08T00:04:10Z

There was a CICD test that failed talking about diffs in the runbook.md file. I don't understand that error. I recall you said the file is currently manually maintained. Should I be removing the entries I added?

alerts/kubelet.libsonnet

runbook.md

tests/tests.yaml

TheRealNoob · 2025-02-18T03:40:18Z

For some reason it's not letting me reply in a conversation to your previous question about adding on ... group_left(node) kubelet_node_name to the eviction alert. I looked into it and came up with this

  sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance) > 0
* on (cluster, instance) group_left(node)
  max by (cluster, instance, node) (
    kubelet_node_name{job="kubelet"}
  )

  sum(rate(kubelet_evictions{%(kubeletSelector)s}[15m])) by(%(clusterLabel)s, eviction_signal, instance) > %(KubeNodeEvictionRateThreshold)s
* on (%(clusterLabel)s, instance) group_left(node)
  max by (%(clusterLabel)s, instance, node) (
   kubelet_node_name{%(kubeletSelector)s}
  )

but it throws the error parse error: vector matching only allowed between instant vectors. The easy way to solve this would be to subquery the timerange like [15m:1m] but I didn't find any examples of this being done, and probably for good reason. Thoughts?

Signed-off-by: TheRealNoob <mike1118@live.com>

skl · 2025-02-20T18:48:32Z

@TheRealNoob I think all you need to do is move the > 0 to the end of the query

- sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance) > 0
+ sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance)
  * on (cluster, instance) group_left(node)
  max by (cluster, instance, node) (
    kubelet_node_name{job="kubelet"}
  )
+ > 0

update KubeNodeEviction query

update KubeNodeEviction test case

TheRealNoob · 2025-02-22T03:07:08Z

Good call. Not sure how I missed that. Updated.

skl

One last failing test and we should be good to go!

skl · 2025-02-24T11:11:43Z

tests/tests.yaml

+    - exp_labels:
+        eviction_signal: memory.available
+        cluster: kubernetes
+        node: minikube


Suggested change

node: minikube

node: minikube

instance: 10.0.2.15:10250

TheRealNoob added 5 commits January 17, 2025 13:10

feat: create alert "KubeletEvictingPods"

4cecfed

Signed-off-by: TheRealNoob <mike1118@live.com>

fix syntax

7da52fc

Merge branch 'master' into kubelet-eviction

c5d6084

move to resources.libsonnet

ac9e485

Signed-off-by: TheRealNoob <mike1118@live.com>

add selector filter

3707ece

Signed-off-by: TheRealNoob <mike1118@live.com>

TheRealNoob requested review from povilasv and skl as code owners January 24, 2025 19:47

skl reviewed Jan 30, 2025

View reviewed changes

alerts/resource_alerts.libsonnet Outdated Show resolved Hide resolved

move {{cluster}} injection

c6c29b6

Co-authored-by: Stephen Lang <skl@users.noreply.github.com>

petewall approved these changes Jan 31, 2025

View reviewed changes

skl reviewed Jan 31, 2025

View reviewed changes

alerts/resource_alerts.libsonnet Outdated Show resolved Hide resolved

skl added the keepalive Use to prevent automatic closing label Feb 3, 2025

TheRealNoob added 3 commits February 6, 2025 14:34

redo alerts

72d2055

Signed-off-by: TheRealNoob <mike1118@live.com>

update runbook

09488cb

Signed-off-by: TheRealNoob <mike1118@live.com>

add tests

f296e4d

Signed-off-by: TheRealNoob <mike1118@live.com>

skl changed the title ~~Add alert KubeEvictionRateHigh~~ feat(alerts): KubeNodePressure and KubeNodeEvictions Feb 7, 2025

fix tests

6118360

Signed-off-by: TheRealNoob <mike1118@live.com>

skl reviewed Feb 7, 2025

View reviewed changes

alerts/kubelet.libsonnet Outdated Show resolved Hide resolved

alerts/kubelet.libsonnet Outdated Show resolved Hide resolved

tests.yaml Outdated Show resolved Hide resolved

tests.yaml Outdated Show resolved Hide resolved

TheRealNoob and others added 2 commits February 7, 2025 12:13

fix "smelly selector" syntax preference

d13e847

this turned out to be a good chance because it made me realize there was an additional label value here that wasn't be handled. Signed-off-by: TheRealNoob <mike1118@live.com>

Update alerts/kubelet.libsonnet

5b396f1

Co-authored-by: Stephen Lang <skl@users.noreply.github.com>

skl reviewed Feb 7, 2025

View reviewed changes

alerts/kubelet.libsonnet Outdated Show resolved Hide resolved

TheRealNoob and others added 3 commits February 7, 2025 12:59

Merge remote-tracking branch 'upstream/master' into kubelet-eviction

b923bea

Update alerts/kubelet.libsonnet

4a2fe4c

Co-authored-by: Stephen Lang <skl@users.noreply.github.com>

add test KubeNodePressure

87f2ff2

Signed-off-by: TheRealNoob <mike1118@live.com>

chore: make --always-make markdownfmt

0478f9b

skl reviewed Feb 14, 2025

View reviewed changes

alerts/kubelet.libsonnet Outdated Show resolved Hide resolved

runbook.md Outdated Show resolved Hide resolved

tests/tests.yaml Outdated Show resolved Hide resolved

TheRealNoob added 2 commits February 17, 2025 21:45

rename KubeNodeEviction, fix test case

3f64e21

Signed-off-by: TheRealNoob <mike1118@live.com>

remove in-progress change

8e5b1bc

Signed-off-by: TheRealNoob <mike1118@live.com>

TheRealNoob added 2 commits February 21, 2025 18:52

Update kubelet.libsonnet

9bd81af

update KubeNodeEviction query

Update tests.yaml

53c589f

update KubeNodeEviction test case

TheRealNoob changed the title ~~feat(alerts): KubeNodePressure and KubeNodeEvictions~~ feat(alerts): KubeNodePressure and KubeNodeEviction Feb 22, 2025

skl reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(alerts): KubeNodePressure and KubeNodeEviction #1014

feat(alerts): KubeNodePressure and KubeNodeEviction #1014

TheRealNoob commented Jan 24, 2025 •

edited

Loading

skl commented Jan 30, 2025

skl commented Jan 30, 2025

skl commented Jan 30, 2025

TheRealNoob commented Feb 4, 2025

skl commented Feb 4, 2025

skl commented Feb 4, 2025

skl commented Feb 4, 2025

TheRealNoob commented Feb 5, 2025

TheRealNoob commented Feb 5, 2025 •

edited

Loading

TheRealNoob commented Feb 6, 2025

skl commented Feb 7, 2025

TheRealNoob commented Feb 8, 2025 •

edited

Loading

TheRealNoob commented Feb 18, 2025 •

edited

Loading

skl commented Feb 20, 2025

TheRealNoob commented Feb 22, 2025

skl left a comment

skl Feb 24, 2025

feat(alerts): KubeNodePressure and KubeNodeEviction #1014

Are you sure you want to change the base?

feat(alerts): KubeNodePressure and KubeNodeEviction #1014

Conversation

TheRealNoob commented Jan 24, 2025 • edited Loading

skl commented Jan 30, 2025

skl commented Jan 30, 2025

skl commented Jan 30, 2025

TheRealNoob commented Feb 4, 2025

skl commented Feb 4, 2025

skl commented Feb 4, 2025

skl commented Feb 4, 2025

TheRealNoob commented Feb 5, 2025

TheRealNoob commented Feb 5, 2025 • edited Loading

TheRealNoob commented Feb 6, 2025

skl commented Feb 7, 2025

TheRealNoob commented Feb 8, 2025 • edited Loading

TheRealNoob commented Feb 18, 2025 • edited Loading

skl commented Feb 20, 2025

TheRealNoob commented Feb 22, 2025

skl left a comment

Choose a reason for hiding this comment

skl Feb 24, 2025

Choose a reason for hiding this comment

TheRealNoob commented Jan 24, 2025 •

edited

Loading

TheRealNoob commented Feb 5, 2025 •

edited

Loading

TheRealNoob commented Feb 8, 2025 •

edited

Loading

TheRealNoob commented Feb 18, 2025 •

edited

Loading