Skip to content

VSC ReFrame meeting 2022 02 10

Sam Moors edited this page Feb 10, 2022 · 1 revision

VSC ReFrame meeting 2022-02-10

Attendees

  • Sam Moors (VUB)
  • Kenneth Hoste (UGent)
  • Franky Backeljauw (Antwerp)
  • Michele Pugno (Antwerp)
  • Robin Verschoren (Antwerp)
  • Maxime Van den Bossche (Leuven)
  • Steven Vandenbrande (Leuven)

Agenda

Notes

  • currently separate systems for hydra, hortense, ...
    • could also have a common vsc system
    • could also be tags instead?
      • vsc tag for tests that should work anywhere
      • site-specific tags for tests that don't work everywhere (yet): ugent, vub, kul, ...
      • mpi, single_node, gpu?
  • current tests work for:
    • VUB (Sam)
    • Hortense (Kenneth)
    • KUL (Steven)
      • default launcher in ReFrame config: mpirun
      • KUL tests assume this in their tests
    • UAntwerpen (Michele)
  • launcher: site-specific or same one everywhere
    • mpirun (used-focused/Torque) vs srun (Slurm)
    • allowing site-specific launcher (using RE vsc:* tag??)
      • vsc:torque, vsc:slurm
  • identify site via $VSC_INSTITUTE environment variable
  • common version of reframe 3.10.1 (developers are reckless)

Conformity checks

  • agreed in CUE (see list at ...)
    • presence and validity of VSC_ environment variables
    • presence of system tools like singularity, ... + version
    • availability + path of different shared filesystems (home/data)
      • important for e.g. Globus
    • local testing (Ghent VSC account on Ghent system) vs cross-site testing
  • how do users check storage quota (tools?)
    • UAntwerpen uses myquota command
    • UGent Tier-2: via accountpage
    • Tier-1 Hortense scratch:
  • ideas
    • availablility of common software modules?
      • ReFrame module
      • EESSI stack?
      • toolchains
    • MPI launcher?
      • srun (VUB, UA @ Vaughn)
      • mpirun (KUL, UA @ Leibniz)
      • mympirun (UGent)
  • testing the VSC network (connecting to other VSC sites, ...)
    • perfsonar project (VSC project @ UA), iperf
      • should not be part of ReFrame test suite
      • continuous performance monitoring
      • connectivity + performance

Cluster health checks

  • submitting simple jobs to different partitions, queues (in order of importance)
    • single-core, multi-core, multi-node?, single-GPU?
      • Different schedulers might be a problem e.g. torque vs slurm
      • A job that tests itself and the env variables of the executing instance
      • test the node file or equivalent env variable
    • tests verify that recommendations in docs work (and keep working)
    • also list jobs, delete jobs, ...

Cluster performance tests

  • CPU tests
    • HPL (LINPACK)
    • c-ray (ray tracing)
    • BLAS-Tester
  • memory tests
    • STREAM
  • shared storage tests
    • IOR
  • network tests
    • OSU Microbenchmarks (latency, bandwidth)
    • basic MPI tests (hello world, ring, ...)

Application tests (commonly used software)

  • CP2K
  • GROMACS
  • Python, numpy
  • R, Bioconductor
  • TensorFlow
  • OpenFOAM
  • collect data about:
    • functionality, verification of results, performance

Security issues

Links

How to run/check the tests?

  • common repo: https://github.com/vscentrum/vsc-test-suite
  • run from any site, automatically spawn to all clusters?
    • dedicated credits account on Leuven systems + Hortense?
  • how to collect/present the data?
    • currently ReFrame only logs perf data
    • send logs to ELK stack?
      • GreyLog + Grafana?
      • Push it back in github repo? Easier way
      • Otherwise how do we access running server with log manager?
  • run weekly/monthly?
    • Difference between large and small tests?
  • dealing with different scheduler frontends (Torque, Slurm)
    • using tags, create system partitions with a common prefix

Goals by next meeting

  • next meeting: Thu 10 Mar 2022 - 14:00

  • folder structure (reframe -R -r), create a tests folder with: (Michele)

    • run.sh --tags xxx --site yyy
    • tests
      • common.py
      • constants.py
        • UGENT = 'ugent'
      • cue
        • common.py
        • env.py
      • micro
        • mpi
          • common.py
          • hello.py
      • apps
        • python
          • numpy.py
        • openfoam
          • motorbike.py
  • CUE tests

    • env (Kenneth, Franky)
      • see CUE list
      • is it defined?
      • is value correct?
      • do path variables point to existing paths? (Different test class in the implementation)
    • tools (Sam, Michele)
      • is command available
      • check version (range, greater equal than )
    • shared FSs (Robin, Sam)
      • /home
      • /data/<site>/<account>
      • /scratch/...
  • MPI hello world (Steven, Kenneth)

  • Franky: share list of env + tools CUE

  • script Michele to run all the tests

export BIN_DIR=/apps/antwerpen/reframe/versions/current/bin
export TESTS_DIR=/apps/antwerpen/reframe/testsuite
export RFM_CONFIG_FILE=$TESTS_DIR/config/settings.py

$BIN_DIR/reframe -v --prefix $TESTS_DIR --perflogdir $TESTS_DIR/perflogs -s $TESTS_DIR/stage -o $TESTS_DIR/output -c $TESTS_DIR -R -r --performance-report