Skip to content

Add new sysfs class for Amazon Elastic Fabric Adapter #515

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

perifaws
Copy link

@perifaws perifaws commented May 8, 2023

This change adds a new sysfs class to read metrics from Amazon Elastic Fabric Adapter (EFA). This change is based on the Infiniband class.

EFA is supported on a variety of Amazon EC2 instances (list here) and is relevant for HPC & distributed training (ML) applications in the same fashion as Infiniband.

There's an associated collector for the node_exporter generated for validation. Happy to provide a sample output as requested. Thanks!

Related to the Prometheus Google Groups thread: https://groups.google.com/g/prometheus-developers/c/MEal59mDebs/m/ZQBU1f0hCAAJ

@perifaws perifaws force-pushed the feature/amazon-efa-sysfs branch 3 times, most recently from f09883d to c4ad75e Compare May 8, 2023 17:48
Signed-off-by: Pierre-Yves Aquilanti <pierreya@amazon.com>
@perifaws perifaws force-pushed the feature/amazon-efa-sysfs branch from c4ad75e to 4b4cb05 Compare May 8, 2023 17:53
@matthiasr
Copy link

Can you please add some unit tests with examples of what the /sys structure looks like? Otherwise this code will be impossible to maintain with confidence.

@dcbw
Copy link
Contributor

dcbw commented May 17, 2023

What's EFA specific about the collector? I can't see anywhere that it checks the PCI device ID or something like that for an Amazon VID/PID. Looks like it just looks in the normal infiniband directories?

eg if I have a random Mellanox IB device, will this collector ignore it?

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants