Skip to content

User‐Mode

Karl W. Schulz edited this page Apr 5, 2024 · 8 revisions

The following highlights enablement of omniwatch monitoring in user-space on ORNL's Frontier or Crusher system. The general steps demonstrated in the example SLURM job include the following basic steps:

  1. enable omniwatch data collection prior to running your desired app
  2. run analysis as usual
  3. summarize results (generate a local pdf highlighting measurements)
  4. tear down the data collection
#SBATCH -J omniwatch     
#SBATCH -o rochpl.128nodes.%j.out 
#SBATCH -N 128
#SBATCH -n 1024
#SBATCH -t 0:45:00      
#SBATCH -A ven114
#SBATCH --cpu-freq=high
#SBATCH -S 0

# (1a) Setup Omniwatch environment
ml use /autofs/nccs-svm1_sw/crusher/amdsw/modules
ml omniwatch/0.1.0

# (1b) Enable data collectors and polling (60 sec interval)
${OMNIWATCH_DIR}/omni_util.py --start --interval 60 --use_pdsh

# (2) Run your desired application(s) as normal
srun ./a.out

# (3) Summarize data collection results
${OMNIWATCH_DIR}/query.py --job ${SLURM_JOB_ID} --interval 60 --pdf omniwatch.${SLURM_JOB_ID}.pdf

# (4)Tear-down data collection
${OMNIWATCH_DIR}/omni_util.py --stop

If successful, the job stdout will include additional information for startup and teardown. Example summary information is highlighted below:

----------------------------------------
HPC Report Card for Job # 461334
----------------------------------------

Job Overview (Num Nodes = 8, Machine = HPC Fund)
 --> Start time = 2024-04-05 13:42:00
 --> End   time = 2024-04-05 13:47:00

GPU Statistics:

           | Utilization (%)  |  Memory Use (%)  | Temperature (C)  |    Power (W)     |
     GPU # |    Max     Mean  |    Max     Mean  |    Max     Mean  |    Max     Mean  |
    ------------------------------------------------------------------------------------
         0 |  100.00   60.92  |   99.54   82.08  |   59.00   47.70  |  551.00  413.48  |
         1 |  100.00   65.00  |   99.54   82.08  |   62.00   49.55  |  551.00  413.48  |
         2 |  100.00   72.70  |   99.54   82.09  |   50.00   40.70  |  559.00  412.90  |
         3 |  100.00   71.40  |   99.54   82.09  |   55.00   42.92  |  559.00  412.90  |
         4 |  100.00   70.12  |   99.54   82.09  |   58.00   45.55  |  558.00  413.55  |
         5 |  100.00   69.85  |   99.54   82.09  |   59.00   48.38  |  558.00  413.55  |
         6 |  100.00   70.50  |   98.99   81.63  |   49.00   41.05  |  556.00  412.30  |
         7 |  100.00   72.55  |   98.99   81.63  |   53.00   40.90  |  556.00  412.30  |
--
Query execution time = 0.6 secs
Version = e8e0a76

In addition, prometheus data will be saved in the following location on the Lustre file system:

/lustre/orion/<project>/scratch/<user>/omniwatch

The output directory can be overridden with an environment variable:

export OMNIWATCH_PROMSERVER_DATADIR=<desired-data-dir>
Clone this wiki locally