Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat(alerts): KubeNodePressure and KubeNodeEviction #1014

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

TheRealNoob
Copy link
Contributor

@TheRealNoob TheRealNoob commented Jan 24, 2025

This PR is (mostly) a copy of #760. As best I can tell it was closed because they wanted to route eviction alerts to individual teams rather than to infra. I want to reopen this conversation on the grounds that I believe it's more correct to route to infra on the basis that eviction scenarios have the potential to impact the entire cluster. Furthermore, it's very possible that the evicted pod (chosen by kubelet) isn't the one applying undue pressure to the Node - in this scenario we would be alerting the team that owns the evicted pod since they're impacted, but they wouldn't be the correct team to fix the issue. There may be further discussion points but these are the big ones in my mind.

There's some parts that I'd appreciate feedback on

  • I don't think the appending of {{cluster}} label onto the description is working, or is worded the best it could be.
  • I read the runbook.md is supposed to be auto-generated but I'm not seeing that behavior.
  • I think a runbook needs to be created for prometheus-operator. Can someone point me to the repo/path that needs a PR?
  • I haven't yet tested the 0.002 threshold from the previous PR, we'll see if maintainers are open to this alert before spending time on testing.

Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
@skl
Copy link
Collaborator

skl commented Jan 30, 2025

I read the runbook.md is supposed to be auto-generated but I'm not seeing that behavior.

It is not auto-generated as far as I'm aware and it seems outdated. I think it's been manually maintained.

@skl
Copy link
Collaborator

skl commented Jan 30, 2025

I think a runbook needs to be created for prometheus-operator. Can someone point me to the repo/path that needs a PR?

Check here:

@skl
Copy link
Collaborator

skl commented Jan 30, 2025

Assuming feedback is positive on this new alert, please create a test in https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/tests.yaml - I can help here if required.

I going to see if I can get some more eyes on this PR for wider feedback.

Co-authored-by: Stephen Lang <skl@users.noreply.github.com>
@skl skl added the keepalive Use to prevent automatic closing label Feb 3, 2025
@TheRealNoob
Copy link
Contributor Author

still working on this. It's surprisingly difficult to reproduce an eviction. I'm also seeing this weird behavior where kubelet isn't exposing the metric kubelet_evictions even though I know I've seen it before and it's documented. Trying to understand what's going on before proceeding.

@skl
Copy link
Collaborator

skl commented Feb 4, 2025

Yeah, no rush, take your time! From what I can see kubelet_evictions is only exported when the value is non-zero. So if there are no evictions according to kubelet, the metric isn't exported.

I've used this helm chart in the past to get nodes into failure scenarios:

I did have to change one line in templates/deployment.yaml to get it working:

- apiVersion: apps/v1beta2
+ apiVersion: apps/v1

Maybe that would help reproduce natural evictions. Otherwise, maybe if you drain a node that would also cause evictions?

@skl
Copy link
Collaborator

skl commented Feb 4, 2025

Oh and by the way, I've been told that kubelet_evictions doesn't work for API-initiated evictions, so I presume it only works for Node-pressure evictions.

@skl
Copy link
Collaborator

skl commented Feb 4, 2025

Linking the kubernetes slack thread here for posterity.

The TL;DR is that there may be a need for two related info-severity alerts, one for evictions and another for node pressure conditions. Neither is necessarily actionable but both can help build up a picture of a node at a given time, which is especially useful when RCA'ing application errors that correlate to a pod eviction.

@TheRealNoob
Copy link
Contributor Author

So I've spent some time now working on redoing the eviction rate alert to account for the non-zero kubelet_evictions value. If we went with an alert as simple as rate(kubelet_evictions[15m]) > 0 then it wouldn't alert on the very first eviction, since the the first time the metric will have appeared is at value 1, meaning there's no rate of change.

In my head so far, the ideal alert follows logic like so. This isn't valid syntax but more like a pseudo-syntax.

  rate(kubelet_evictions[15m]) by(eviction_signal) > 0
or
    kubelet_evictions =1
  and on(*)
    10 minutes ago the same timeseries (same labels) was absent

It's clear to me at this point that the only way to even possibly account for this scenario in the alert definition is to use the or on() vector(0) or absent() functions. But then you also need to group_left(node) those the output from those functions into the each output from kubelet_evictions in order for any lookback to be effective. Before I go any further let me use an example:

rate(
  (
    sum by (eviction_signal) (kubelet_evictions) or on () vector(0)
  )[10m:30s]
)

image

This is a minikube cluster so single node, and the timeframe shown is when the first eviction occured (13:53). You can see the query yields two timeseries, one for kubelet_evictions and another for vector(0). If you want to do a lookbehind then the two have to be grouped into a single timeseries. Additionally to take it even further, you need to check whether the metric simply didn't exist, or if the node went offline, so you should do some form and up{job=kubelet} in there somewhere. At least I think so -- My theory is a bit ahead of my practical testing at this point as I'm really questioning whether it's worth this amount of work just to alert on the first eviction in the entire cluster.

@TheRealNoob
Copy link
Contributor Author

TheRealNoob commented Feb 5, 2025

Lastly while still on the topic of the eviction alert, I have two thoughts and they're heavily inter-dependent, so anxious to hear your thoughts.

  1. I think it still makes sense to use rate(...[15m]) just to expand the time window. It can take a few minutes for a pod to be rescheduled and go through startup. If the window is set too low then you'll have a flapping alert.
  2. I think it makes sense to set the comparison at > 0. Originally I didn't feel this way, but since this is severity=info, it feels harmless. I imagine anyone who increases the severity wants to be notified of all. Or they make a second alert with a high severity and threshold. Plus, it will mean that a 15 minute window triggers immediately. If you were to try to set the threshold at >1 pod per 15 minutes 1 pod / 900 (seconds) == 0.00111111111 then you would have to wait the full 15. Obviously nobody wants to wait 15, so you'd lower the window, but that increases the probability of flapping.

Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
@TheRealNoob
Copy link
Contributor Author

ready for re-review

@skl skl changed the title Add alert KubeEvictionRateHigh feat(alerts): KubeNodePressure and KubeNodeEvictions Feb 7, 2025
Signed-off-by: TheRealNoob <mike1118@live.com>
TheRealNoob and others added 2 commits February 7, 2025 12:13
this turned out to be a good chance because it made me realize there was an additional label value here that wasn't be handled.

Signed-off-by: TheRealNoob <mike1118@live.com>
Co-authored-by: Stephen Lang <skl@users.noreply.github.com>
@skl
Copy link
Collaborator

skl commented Feb 7, 2025

FYI the tests have moved into tests/ directory since:

TheRealNoob and others added 3 commits February 7, 2025 12:59
Co-authored-by: Stephen Lang <skl@users.noreply.github.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
@TheRealNoob
Copy link
Contributor Author

TheRealNoob commented Feb 8, 2025

There was a CICD test that failed talking about diffs in the runbook.md file. I don't understand that error. I recall you said the file is currently manually maintained. Should I be removing the entries I added?

@TheRealNoob
Copy link
Contributor Author

TheRealNoob commented Feb 18, 2025

For some reason it's not letting me reply in a conversation to your previous question about adding on ... group_left(node) kubelet_node_name to the eviction alert. I looked into it and came up with this

  sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance) > 0
* on (cluster, instance) group_left(node)
  max by (cluster, instance, node) (
    kubelet_node_name{job="kubelet"}
  )
  sum(rate(kubelet_evictions{%(kubeletSelector)s}[15m])) by(%(clusterLabel)s, eviction_signal, instance) > %(KubeNodeEvictionRateThreshold)s
* on (%(clusterLabel)s, instance) group_left(node)
  max by (%(clusterLabel)s, instance, node) (
   kubelet_node_name{%(kubeletSelector)s}
  )

but it throws the error parse error: vector matching only allowed between instant vectors. The easy way to solve this would be to subquery the timerange like [15m:1m] but I didn't find any examples of this being done, and probably for good reason. Thoughts?

Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
@skl
Copy link
Collaborator

skl commented Feb 20, 2025

@TheRealNoob I think all you need to do is move the > 0 to the end of the query

- sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance) > 0
+ sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance)
  * on (cluster, instance) group_left(node)
  max by (cluster, instance, node) (
    kubelet_node_name{job="kubelet"}
  )
+ > 0

update KubeNodeEviction query
update KubeNodeEviction test case
@TheRealNoob
Copy link
Contributor Author

Good call. Not sure how I missed that. Updated.

@TheRealNoob TheRealNoob changed the title feat(alerts): KubeNodePressure and KubeNodeEvictions feat(alerts): KubeNodePressure and KubeNodeEviction Feb 22, 2025
Copy link
Collaborator

@skl skl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last failing test and we should be good to go!

- exp_labels:
eviction_signal: memory.available
cluster: kubernetes
node: minikube
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node: minikube
node: minikube
instance: 10.0.2.15:10250

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
keepalive Use to prevent automatic closing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants