-
Notifications
You must be signed in to change notification settings - Fork 0
User‐Mode
Karl W. Schulz edited this page Mar 17, 2024
·
8 revisions
The following highlights enablement of omniwatch monitoring in user-space on ORNL's Frontier or Crusher system. The general steps demonstrated in the example SLURM job include the following basic steps:
- enable omniwatch data collection prior to running your desired app
- run analysis as usual
- tear down the data collection
#SBATCH -J omniwatch
#SBATCH -o rochpl.128nodes.%j.out
#SBATCH -N 128
#SBATCH -n 1024
#SBATCH -t 0:45:00
#SBATCH -A ven114
#SBATCH --cpu-freq=high
#SBATCH -S 0
# (1a) Setup Omniwatch environment
ml use /autofs/nccs-svm1_sw/crusher/amdsw/modules
ml omniwatch/0.1.0
# (1b) Enable data collectors and polling (60 sec interval)
${OMNIWATCH_DIR}/omni_util.py --startexporters --use_pdsh
${OMNIWATCH_DIR}/omni_util.py --startserver --interval 60
# (2) Run your desired application(s) as normal
srun ./a.out
# (3)Tear-down data collection
${OMNIWATCH_DIR}/query.py --job ${SLURM_JOB_ID}
${OMNIWATCH_DIR}/omni_util.py --stop
If successful, the job stdout will include additional information for startup and teardown. Example summary information is highlighted below:
----------------------------------------
HPC Report Card for Job # 453814
----------------------------------------
Job Overview (Num Nodes = 128, Machine = HPC Fund)
--> Start time = 2024-03-17 16:58:54
--> End time = 2024-03-17 17:06:55
GPU Statistics:
| Utilization (%) | Memory Use (%) | Temperature (C) | Power (W) |
GPU # | Max Mean | Max Mean | Max Mean | Max Mean |
------------------------------------------------------------------------------------
0 | 100.00 56.14 | 50.75 50.75 | 56.00 47.86 | 535.00 408.57 |
1 | 100.00 52.71 | 50.75 50.75 | 59.00 51.43 | 535.00 408.57 |
2 | 100.00 52.00 | 50.75 50.75 | 51.00 44.43 | 538.00 410.29 |
3 | 100.00 52.29 | 50.75 50.75 | 54.00 47.57 | 538.00 410.29 |
4 | 100.00 36.57 | 50.75 50.75 | 53.00 46.29 | 539.00 408.57 |
5 | 100.00 51.14 | 50.75 50.75 | 56.00 49.29 | 539.00 408.57 |
6 | 100.00 52.00 | 50.75 50.75 | 47.00 40.29 | 526.00 401.57 |
7 | 100.00 58.71 | 50.75 50.75 | 52.00 45.43 | 526.00 401.57 |
--
Query execution time = 0.5 secs
Version = 5828415
In addition, prometheus data will be saved in the following location on the Lustre file system:
/lustre/orion/<project>/scratch/<user>/omniwatch