Skip to content

User‐Mode

Karl W. Schulz edited this page Mar 17, 2024 · 8 revisions

The following highlights enablement of omniwatch monitoring in user-space on ORNL's Frontier or Crusher system. The general steps demonstrated in the example SLURM job include the following basic steps:

  1. enable omniwatch data collection prior to running your desired app
  2. run analysis as usual
  3. tear down the data collection
#SBATCH -J omniwatch     
#SBATCH -o rochpl.128nodes.%j.out 
#SBATCH -N 128
#SBATCH -n 1024
#SBATCH -t 0:45:00      
#SBATCH -A ven114
#SBATCH --cpu-freq=high
#SBATCH -S 0

# (1a) Setup Omniwatch environment
ml use /autofs/nccs-svm1_sw/crusher/amdsw/modules
ml omniwatch/0.1.0

# (1b) Enable data collectors and polling (60 sec interval)
${OMNIWATCH_DIR}/omni_util.py --startexporters --use_pdsh
${OMNIWATCH_DIR}/omni_util.py --startserver --interval 60

# (2) Run your desired application(s) as normal
srun ./a.out

# (3)Tear-down data collection
${OMNIWATCH_DIR}/query.py --job ${SLURM_JOB_ID}
${OMNIWATCH_DIR}/omni_util.py --stop

If successful, the job stdout will include additional information for startup and teardown. Example summary information is highlighted below:

----------------------------------------
HPC Report Card for Job # 453814
----------------------------------------

Job Overview (Num Nodes = 128, Machine = HPC Fund)
 --> Start time = 2024-03-17 16:58:54
 --> End   time = 2024-03-17 17:06:55

GPU Statistics:

           | Utilization (%)  |  Memory Use (%)  | Temperature (C)  |    Power (W)     |
     GPU # |    Max     Mean  |    Max     Mean  |    Max     Mean  |    Max     Mean  |
    ------------------------------------------------------------------------------------
         0 |  100.00   56.14  |   50.75   50.75  |   56.00   47.86  |  535.00  408.57  |
         1 |  100.00   52.71  |   50.75   50.75  |   59.00   51.43  |  535.00  408.57  |
         2 |  100.00   52.00  |   50.75   50.75  |   51.00   44.43  |  538.00  410.29  |
         3 |  100.00   52.29  |   50.75   50.75  |   54.00   47.57  |  538.00  410.29  |
         4 |  100.00   36.57  |   50.75   50.75  |   53.00   46.29  |  539.00  408.57  |
         5 |  100.00   51.14  |   50.75   50.75  |   56.00   49.29  |  539.00  408.57  |
         6 |  100.00   52.00  |   50.75   50.75  |   47.00   40.29  |  526.00  401.57  |
         7 |  100.00   58.71  |   50.75   50.75  |   52.00   45.43  |  526.00  401.57  |

--
Query execution time = 0.5 secs
Version = 5828415

In addition, prometheus data will be saved in the following location on the Lustre file system:

/lustre/orion/<project>/scratch/<user>/omniwatch

Clone this wiki locally