Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix grafana dashboard and clarify dashboard usage more clearly. #543

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

jiangsanyin
Copy link
Contributor

Signed-off-by: jiangsanyin 1327212357@qq.com

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix grafana dashboard and clarify dashboard usage more clearly. Thanks "fangfenghuang (https://github.com/fangfenghuang)" for your help

Which issue(s) this PR fixes:
Fixes #498 #468

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

@wawa0210
Copy link
Member

@fangfenghuang Can you help review this pr?

Copy link

codecov bot commented Oct 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Flag Coverage Δ
unittests 27.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link

@fangfenghuang fangfenghuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix some http url

@Nimbus318
Copy link
Contributor

@jiangsanyin
I have followed the installation instructions as described in the documentation, but encountered a minor issue, which I also mentioned previously in Issue #498. By default, the dcgm-exporter only includes the Hostname label. To match the current Grafana dashboard configuration, it's necessary to add a node_name relabeling configuration when installing dcgm-exporter

https://github.com/NVIDIA/dcgm-exporter/blob/b97b7633e3f39f7a537bd77561cc0ec0c2dca3f5/deployment/values.yaml#L117C3-L117C18

This relabeling should be consistent with the configurations for hami-device-plugin-svc-monitor and hami-scheduler-svc-monitor

It would be helpful to include this information in the documentation, as users unfamiliar with the Prometheus stack may struggle to configure everything correctly on the first attempt

@jiangsanyin
Copy link
Contributor Author

jiangsanyin commented Nov 25, 2024

@jiangsanyin I have followed the installation instructions as described in the documentation, but encountered a minor issue, which I also mentioned previously in Issue #498. By default, the dcgm-exporter only includes the Hostname label. To match the current Grafana dashboard configuration, it's necessary to add a node_name relabeling configuration when installing dcgm-exporter

https://github.com/NVIDIA/dcgm-exporter/blob/b97b7633e3f39f7a537bd77561cc0ec0c2dca3f5/deployment/values.yaml#L117C3-L117C18

This relabeling should be consistent with the configurations for hami-device-plugin-svc-monitor and hami-scheduler-svc-monitor

It would be helpful to include this information in the documentation, as users unfamiliar with the Prometheus stack may struggle to configure everything correctly on the first attempt

Have you created and applied the ServiceMonitor as depicted in dashboard.md or dashboard_cn.md?node_name is added after this is done.
#Create the file hami-device-plugin-svc-monitor.yaml
root@controller01:~# cat hami-device-plugin-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hami-device-plugin-svc-monitor
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-device-plugin
namespaceSelector:
matchNames:
- "kube-system"
endpoints:

  • path: /metrics
    port: monitorport
    interval: "15s"
    honorLabels: false
    relabelings:
    • sourceLabels: [__meta_kubernetes_endpoints_name]
      regex: hami-.*
      replacement: $1
      action: keep
    • sourceLabels: [__meta_kubernetes_pod_node_name]
      regex: (.*)
      targetLabel: node_name
      replacement: ${1}
      action: replace
    • sourceLabels: [__meta_kubernetes_pod_host_ip]
      regex: (.*)
      targetLabel: ip
      replacement: $1
      action: replace

#apply the file hami-device-plugin-svc-monitor.yaml
root@controller01:~# kubectl apply -f hami-device-plugin-svc-monitor.yaml

@Nimbus318
Copy link
Contributor

@jiangsanyin
Both are correct. What I meant is that you might have forgotten to include the explanation for the relabel configuration of dcgm-exporter. By default, dcgm-exporter only includes the Hostname label

It’s important to document this configuration to ensure it aligns with the relabeling setup for hami-device-plugin-svc-monitor. Without this explanation, users may miss adding the necessary node_name relabeling when setting up dcgm-exporter

…he image display problem in document; Change deployment/values.yaml before deploying dcgm-exporter.

Signed-off-by: jiangsanyin <1327212357@qq.com>
@jiangsanyin
Copy link
Contributor Author

jiangsanyin commented Nov 28, 2024

@jiangsanyin Both are correct. What I meant is that you might have forgotten to include the explanation for the relabel configuration of dcgm-exporter. By default, dcgm-exporter only includes the Hostname label

It’s important to document this configuration to ensure it aligns with the relabeling setup for hami-device-plugin-svc-monitor. Without this explanation, users may miss adding the necessary node_name relabeling when setting up dcgm-exporter

Ok, thanks to your review. Certain relabelings configurations in serviceMonitor for dcgm-exporter has been added in dashboard_cn.md and dashboard.md, please check!
image
image

@Nimbus318
Copy link
Contributor

/lgtm

1 similar comment
@wawa0210
Copy link
Member

/lgtm

@wawa0210 wawa0210 merged commit 3c220fc into Project-HAMi:master Dec 19, 2024
5 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

monitoring data node_name not exists ,GPU power usage is not correct in Grafana Dashboard
5 participants