-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
monitoring data node_name not exists ,GPU power usage is not correct in Grafana Dashboard #498
Comments
After attempting to deploy this dashboard myself, I encountered similar issues. By comparing the original metrics, I noticed the following:
Hope this helps resolve the issues you're experiencing! |
@jiangsanyin I am using Finally, if you're unable to use |
Thank you for your prompt reply! These problems within this issue happended to me in the course of work, I'll read your suggestion carefully next monday. ^_^ |
hami 31992 and 31993 metrics has no label related to node_name,so i added a label node_name for selecting gpu node metrics. label node_name is replaced from __meta_kubernetes_pod_node_name. |
@Nimbus318 this part helps me. Now ${Hostname} gives me the nodename of k8s cluster. However, there are problems in the four charts as following, because there is no entry reletaed to "Device_memory_desc_of_container" has been collected into my Prometheus, and other 3 charts don't have data because of the same reason. |
@jiangsanyin The reason you don't see Device_memory_desc_of_container in your Prometheus metrics is that this metric is exposed by the hami-device-plugin. However, Prometheus does not have a scrape rule configured to collect these metrics. Based on your previous response, it looks like you can use the ServiceMonitor. You can try applying the following YAML configuration: apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hami-device-plugin-svc-monitor
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-device-plugin
namespaceSelector:
matchNames:
- "kube-system"
endpoints:
- path: /metrics
port: monitorport
interval: "15s"
honorLabels: false Based on this requirement, I think we can add a configuration similar to the following in the Hami chart: devicePlugin:
serviceMonitor:
enabled: true
interval: 15s
honorLabels: false
additionalLabels:
relabelings: [] This configuration will allow users to decide whether to enable the ServiceMonitor for the devicePlugin. I might discuss with the community whether this configuration is necessary. |
Thanks, your reply works for me. awsome! |
Environment:
Please provide an in-depth description of the question you have:

(1)I had installed HAMi successfully, and it works well when runing vGPU task.
from port 31993, I can get monitoring information as followed:
(2)I deployed dcgm-exporter by runing “kubectl -n monitoring create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml”, and changed the type of svc/dcqm-exporter to NodePort from ClusterIP:

(3)I have deployed Prometheus 2.36.1 in binary mode and made the following configurations:


Targets page from Promethus proves prometheus has already collected monitoring data:
(4)I deployed Grafana v8.5.5 in k8s-1.23.10 cluster, A data source named ALL in Grafana was created:

(5)I imported a dashboard(https://github.com/Project-HAMi/HAMi/blob/master/docs/gpu-dashboard.json), but some of the data presented was inaccurate or was not existed.

for example, "nodename" in the upper left corner has no data, and the value of "GPU power usage" is not inaccurate(My GPU is a NVIDIA A10, whose GPU power usage is 150W)
What do you think about this question?:
(1)A friend name "凤" from Internet share me this dashboard "https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/", but the problems mentioned above still exist.
(2)I founded ${node_name} was used in the two dashboard mentioned above, but ${node_name} is NULL. I don't know what's going wrong, please help
The text was updated successfully, but these errors were encountered: