-
Notifications
You must be signed in to change notification settings - Fork 17
Usage on Various Compute Clusters
This document presents step-by-step instructions for installing and training Saber on various compute clusters.
These instructions will be written for the Béluga cluster in particular, but usage across all Compute Canada (CC) clusters should be nearly identical.
Start by SSH'ing into a login node, e.g.
$ ssh <username>@beluga.computecanada.ca
Then clone the repo to your PROJECT
folder
# "def-someuser" will be the group you belong to
$ PROJECT_DIR=~/projects/<def-someuser>/<username>
$ cd $PROJECT_DIR
$ git clone https://github.com/BaderLab/saber.git
$ cd saber
Next, we will create an environment and install the package and all its dependencies. Note, you only need to do this once.
# Path to where the environment will be created
$ ENVDIR=~/saber
# Create a virtual environment
$ module load python/3.7 cuda/10.0
$ virtualenv --no-download $ENVDIR
$ source $ENVDIR/bin/activate
(saber) $ pip install --upgrade pip
# Packages available in the CC wheelhouse
(saber) $ pip install scikit-learn torch pytorch_transformers Keras-Preprocessing spacy nltk neuralcoref --no-index
# Install Saber
(saber) $ git checkout development
(saber) $ pip install -e .
# Download and Install a SpaCy model (OPTIONAL ; not required for training)
(saber) $ python -m spacy download en_core_web_md
# Install seqeval fork (TEMPORARY)
(saber) $ pip install git+https://github.com/JohnGiorgi/seqeval.git
# Install Apex (OPTIONAL)
(saber) $ module load gcc/7.3.0
(saber) $ git clone https://github.com/NVIDIA/apex
(saber) $ cd apex
(saber) $ python setup.py install --cpp_ext --cuda_ext
# Keep track of all requirements in this env so it can be recreated (OPTIONAL)
(saber) $ pip freeze > cc_requirements.txt
Make a directory to store your datasets, e.g.
(saber) $ mkdir $PROJECT_DIR/saber/datasets
Place any datasets you would like to train on in this folder.
Because the compute nodes are air-gapped, you will need to download a BERT model on the login node. Note that you only have to do this once.
- If you want to use the default BERT model (BioBert V1.1. ; recommended), simply call a training session and cancel it as soon as training begins
(saber) $ python -m saber.cli.train --dataset_folder path/to/dataset
-
If you want to use one of the BERT models from pytorch-transformers (see here for list of pre-trained BERT models), first set
saber.constants.PRETRAINED_BERT_MODEL
to your model name and run a training session, cancelling it as soon as training begins (as above). -
If you want to supply your own model, simply set
saber.constants.PRETRAINED_BERT_MODEL
to your model's path on disk. There is no need to run a training session.
To train the model, you will need to create a train.sh
script. For example:
#!/bin/bash
#SBATCH --account=def-someuser
# Requested resources
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=10
# Wall time and job details
#SBATCH --time=1:00:00
#SBATCH --job-name=example
#SBATCH --output=./output/%j.txt
# Emails me when job starts, ends or fails
#SBATCH --mail-user=example@gmail.com
#SBATCH --mail-type=ALL
# Use this command to run the same job interactively
# salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00
# Load required models and activate the enviornment
ENVDIR=~/saber
WORKDIR=/home/johnmg/projects/def-gbader/johnmg/saber
module load python/3.7 cuda/10.0
source $ENVDIR/bin/activate
cd $WORKDIR
# Train the model
python -m saber.cli.train --dataset_folder path/to/dataset
Submit this job to the queue with sbatch train.sh
. To run the same job interactively, use
salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00
Note, on Beluga, you should use a maximum of 10 CPUs per GPU requested.