-
Notifications
You must be signed in to change notification settings - Fork 59
LDMS connector
The Lightweight Data Monitoring System (LDMS) is a health monitoring system to monitor performance of an HPC System, where an HPC system is defined as set of independent applications running on a particular supercomputer or supercomputing platform. LDMS is actively developed at Sandia National Laboratories and used across Leadership Computing Facilities each of which are associated with a U.S. Department of Energy Laboratory, e.g., OLCF.
Due to very large amounts of data gathered, often 10s of TB per day, the LDMS Kokkos Tools connector should be used with the sampler utility of Kokkos tools to extract profiling data samples from a Kokkos application program.
- Collected LDMS data is already on node, and queryable, with little to no overhead.
- LDMS can be cloned and installed here: https://github.com/ovis-hpc/ovis
- Information and Quickstart guide can be found here: https://ovis-hpcreadthedocs.readthedocs.io/en/latest/
- The following environment variables can be adjusted by users in their run scripts, for example:
export KOKKOS_TOOLS_SAMPLER_SKIP=4 export KOKKOS_TOOLS_SAMPLER_PROB=20.6 export KOKKOS_LDMS_VERBOSE=0
- The tool's environment variable
KOKKOS_TOOLS_SAMPLER_PROB
sets the sampling rate of kernel function calls. It is associated with the sampler utility to be used in conjunction with the LDMS connector. The default is set to 1%. Currently, one can use either the sampler skip rate or the sampler probability. - The tool's environment variable
KOKKOS_LDMS_VERBOSE
prints all Kokkos messages that are sent to LDMS to an output file when set to a non-zero integer.
All collected data by LDMS are stored in the built storage system (DSOS) provided in the LDMS setup tutorials above.
Data can be visualized using Grafana. Information about setting up and using Grafana can be found here: https://ovis-hpcreadthedocs.readthedocs.io/en/latest/grafanapanel.html.
SAND2017-3786