-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat(alerts): KubeNodePressure and KubeNodeEviction #1014
base: master
Are you sure you want to change the base?
feat(alerts): KubeNodePressure and KubeNodeEviction #1014
Conversation
Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
It is not auto-generated as far as I'm aware and it seems outdated. I think it's been manually maintained. |
Check here: |
Assuming feedback is positive on this new alert, please create a test in https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/tests.yaml - I can help here if required. I going to see if I can get some more eyes on this PR for wider feedback. |
Co-authored-by: Stephen Lang <skl@users.noreply.github.com>
still working on this. It's surprisingly difficult to reproduce an eviction. I'm also seeing this weird behavior where kubelet isn't exposing the metric |
Yeah, no rush, take your time! From what I can see I've used this helm chart in the past to get nodes into failure scenarios: I did have to change one line in - apiVersion: apps/v1beta2
+ apiVersion: apps/v1 Maybe that would help reproduce natural evictions. Otherwise, maybe if you drain a node that would also cause evictions? |
Oh and by the way, I've been told that |
Linking the kubernetes slack thread here for posterity. The TL;DR is that there may be a need for two related info-severity alerts, one for evictions and another for node pressure conditions. Neither is necessarily actionable but both can help build up a picture of a node at a given time, which is especially useful when RCA'ing application errors that correlate to a pod eviction. |
So I've spent some time now working on redoing the eviction rate alert to account for the non-zero In my head so far, the ideal alert follows logic like so. This isn't valid syntax but more like a pseudo-syntax.
It's clear to me at this point that the only way to even possibly account for this scenario in the alert definition is to use the
This is a minikube cluster so single node, and the timeframe shown is when the first eviction occured (13:53). You can see the query yields two timeseries, one for |
Lastly while still on the topic of the eviction alert, I have two thoughts and they're heavily inter-dependent, so anxious to hear your thoughts.
|
Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
ready for re-review |
this turned out to be a good chance because it made me realize there was an additional label value here that wasn't be handled. Signed-off-by: TheRealNoob <mike1118@live.com>
Co-authored-by: Stephen Lang <skl@users.noreply.github.com>
FYI the tests have moved into |
Co-authored-by: Stephen Lang <skl@users.noreply.github.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
There was a CICD test that failed talking about diffs in the runbook.md file. I don't understand that error. I recall you said the file is currently manually maintained. Should I be removing the entries I added? |
For some reason it's not letting me reply in a conversation to your previous question about adding
but it throws the error |
Signed-off-by: TheRealNoob <mike1118@live.com>
Signed-off-by: TheRealNoob <mike1118@live.com>
@TheRealNoob I think all you need to do is move the - sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance) > 0
+ sum(rate(kubelet_evictions{job="kubelet"}[15m])) by(cluster, eviction_signal, instance)
* on (cluster, instance) group_left(node)
max by (cluster, instance, node) (
kubelet_node_name{job="kubelet"}
)
+ > 0 |
update KubeNodeEviction query
update KubeNodeEviction test case
Good call. Not sure how I missed that. Updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last failing test and we should be good to go!
- exp_labels: | ||
eviction_signal: memory.available | ||
cluster: kubernetes | ||
node: minikube |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node: minikube | |
node: minikube | |
instance: 10.0.2.15:10250 |
This PR is (mostly) a copy of #760. As best I can tell it was closed because they wanted to route eviction alerts to individual teams rather than to infra. I want to reopen this conversation on the grounds that I believe it's more correct to route to infra on the basis that eviction scenarios have the potential to impact the entire cluster. Furthermore, it's very possible that the evicted pod (chosen by kubelet) isn't the one applying undue pressure to the Node - in this scenario we would be alerting the team that owns the evicted pod since they're impacted, but they wouldn't be the correct team to fix the issue. There may be further discussion points but these are the big ones in my mind.
There's some parts that I'd appreciate feedback on
{{cluster}}
label onto the description is working, or is worded the best it could be.0.002
threshold from the previous PR, we'll see if maintainers are open to this alert before spending time on testing.