-
Notifications
You must be signed in to change notification settings - Fork 0
User‐Mode
Karl W. Schulz edited this page Apr 5, 2024
·
8 revisions
The following highlights enablement of omniwatch monitoring in user-space on ORNL's Frontier or Crusher system. The general steps demonstrated in the example SLURM job include the following basic steps:
- enable omniwatch data collection prior to running your desired app
- run analysis as usual
- summarize results (generate a local pdf highlighting measurements)
- tear down the data collection
#SBATCH -J omniwatch
#SBATCH -o rochpl.128nodes.%j.out
#SBATCH -N 128
#SBATCH -n 1024
#SBATCH -t 0:45:00
#SBATCH -A ven114
#SBATCH --cpu-freq=high
#SBATCH -S 0
# (1a) Setup Omniwatch environment
ml use /autofs/nccs-svm1_sw/crusher/amdsw/modules
ml omniwatch/0.1.0
# (1b) Enable data collectors and polling (60 sec interval)
${OMNIWATCH_DIR}/omni_util.py --start --interval 60 --use_pdsh
# (2) Run your desired application(s) as normal
srun ./a.out
# (3) Summarize data collection results
${OMNIWATCH_DIR}/query.py --job ${SLURM_JOB_ID} --interval 60 --pdf omniwatch.${SLURM_JOB_ID}.pdf
# (4)Tear-down data collection
${OMNIWATCH_DIR}/omni_util.py --stop
If successful, the job stdout will include additional information for startup and teardown. Example summary information is highlighted below:
----------------------------------------
HPC Report Card for Job # 461334
----------------------------------------
Job Overview (Num Nodes = 8, Machine = HPC Fund)
--> Start time = 2024-04-05 13:42:00
--> End time = 2024-04-05 13:47:00
GPU Statistics:
| Utilization (%) | Memory Use (%) | Temperature (C) | Power (W) |
GPU # | Max Mean | Max Mean | Max Mean | Max Mean |
------------------------------------------------------------------------------------
0 | 100.00 60.92 | 99.54 82.08 | 59.00 47.70 | 551.00 413.48 |
1 | 100.00 65.00 | 99.54 82.08 | 62.00 49.55 | 551.00 413.48 |
2 | 100.00 72.70 | 99.54 82.09 | 50.00 40.70 | 559.00 412.90 |
3 | 100.00 71.40 | 99.54 82.09 | 55.00 42.92 | 559.00 412.90 |
4 | 100.00 70.12 | 99.54 82.09 | 58.00 45.55 | 558.00 413.55 |
5 | 100.00 69.85 | 99.54 82.09 | 59.00 48.38 | 558.00 413.55 |
6 | 100.00 70.50 | 98.99 81.63 | 49.00 41.05 | 556.00 412.30 |
7 | 100.00 72.55 | 98.99 81.63 | 53.00 40.90 | 556.00 412.30 |
--
Query execution time = 0.6 secs
Version = e8e0a76
In addition, prometheus data will be saved in the following location on the Lustre file system:
/lustre/orion/<project>/scratch/<user>/omniwatch
The output directory can be overridden with an environment variable:
export OMNIWATCH_PROMSERVER_DATADIR=<desired-data-dir>