The default metrics available in kubernetes are not sufficient to monitor and set alerts on cronjobs. We will add custom metric to achieve this.
Legend uses custom prometheus rule to set alerting and monitoring on kubernetes cronjobs. We have followed this document by Tristan Colgate-McFarlane
. The document explains the need to add custom metric and how it is done. We have made some minor changes in metric query to make it work for latest kube-state-metrics.
All the metrics plotted per component are part of the metrics library which lives within legend at legend/metrics_library/metrics
.
Each component has an associated metrics file in the metrics library in the format
<component>_metrics.yaml
. The metrics file is a Jinja2 template which is rendered to
a yaml file.
You can apply below kubernetes manifest on your cluster to add custom metric. You must have this CRD - monitoring.coreos.com/v1
available on your cluster to apply this rule.
To use this yaml copy and paste this in a file. Then use kubectl to apply in cluster.
kubectl apply -f <file-name.yaml> -n <namespace>
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cronjob-prometheus-metric
labels:
app: kube-prometheus-stack
release: prom-op
spec:
groups:
- name: ./cronjob.rules
rules:
- record: job:kube_job_status_start_time:max
expr: |
label_replace(
label_replace(
max(
kube_job_status_start_time
* ON(job_name,namespace) GROUP_RIGHT()
kube_job_owner{owner_name!=""}
)
BY (job_name, owner_name, namespace)
== ON(owner_name) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(job_name,namespace) GROUP_RIGHT()
kube_job_owner{owner_name!=""}
)
BY (owner_name),
"job", "$1", "job_name", "(.+)"),
"cronjob", "$1", "owner_name", "(.+)")
- record: job:kube_job_status_failed:sum
expr: |
clamp_max(
job:kube_job_status_start_time:max,1)
* ON(job,namespace) GROUP_LEFT()
label_replace(
label_replace(
(kube_job_status_failed != 0),
"job", "$1", "job_name", "(.+)"),
"cronjob", "$1", "owner_name", "(.+)")