Skip to content

Latest commit

 

History

History
154 lines (114 loc) · 9.95 KB

infrastructure_optimisation_zeus.md

File metadata and controls

154 lines (114 loc) · 9.95 KB

microPIPE on Zeus/Topaz @ Pawsey


Accessing tool/workflow

The workflow can be downloaded from the GitHub page https://github.com/BeatsonLab-MicrobialGenomics/micropipe using the command:

git clone https://github.com/BeatsonLab-MicrobialGenomics/micropipe.git

Installation

  • Nextflow
    A modified version of Nextflow, capable of submitting jobs to Zeus, Topaz and Magnus, has been installed as a system module and can be accessed with the command:
module load nextflow/20.07.1-multi
  • Singularity
    Singularity has been installed as a system module and can be accessed with the command:
module load singularity/3.6.4 
  • Guppy (3.6.1 was the latest working version)
    Due to the Oxford Nanopore Technologies terms and conditions, we are not allowed to redistribute the Guppy software either in its binary form or packaged form e.g. Docker or Singularity images. Therefore users will have to either install Guppy, provide a container image or start the pipeline from the basecalled fastq files. See Usage section for instructions.

Quickstart tutorial

A tutorial is available on the GitHub page: https://github.com/BeatsonLab-MicrobialGenomics/micropipe#usage. The steps are summarised below including the specific instructions required to run the pipeline at Pawsey Zeus.

1. Prepare the Nextflow configuration file (nextflow.config)
Use the configuration file to run microPIPE at Pawsey Zeus here.

2. Prepare the samplesheet file (csv)
See instructions at the microPIPE GitHub page, section 2. Prepare the samplesheet file.

3. Prepare the slurm script (e.g. nextflow_batch_template.sh)
The pipeline will be launched using a Slurm script submitted to Zeus. This script will load the required modules, define the input/output directories and files, and include the nextflow command line with optional parameters. Note that the configuration profile definition for the Zeus cluster should be specified when launching the pipeline execution by using the "-profile zeus" command line option, as well as the slurm account allocation by using the "--slurm_account='director2172'" command line option (replace 'director2172' by your account identifier).

#!/bin/bash

#SBATCH --job-name=micropipe
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --output=s%A.micropipe_guppy3.6.1_gpu_12samples.out
#SBATCH --error=s%A.micropipe_guppy3.6.1_gpu_12samples.err
#SBATCH --time=24:00:00

module load nextflow/20.07.1-multi
module load singularity/3.6.4 

#directory containing the nextflow.config file and the main.nf script
dir=/scratch/director2172/vmurigneux/micropipe
cd ${dir}
datadir=${dir}/Illumina
out_dir=${dir}/results_3.6.1_gpu

#Run A, B or C depending on whether you are starting with ONT fast5 (A or B) or fastq files (C or D) 

#A) Workflow including basecalling, demultiplexing and assembly
fast5_dir=${dir}/fast5_pass
csv=${dir}/test_data/samples_all_basecalling.csv
nextflow main.nf --gpu true --basecalling  -profile zeus --slurm_account='director2172' --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir}
#nextflow main.nf --gpu false --basecalling --guppy_num_callers 16 -profile zeus --slurm_account='director2172' --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir}

#B) Workflow including basecalling and assembly (skip demultiplexing step)
#fast5_dir=${dir}/fast5_pass
#csv=${dir}/test_data/samples_1_basecalling_single_isolate.csv
#nextflow main.nf --basecalling --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

#C) Workflow including demultiplexing and assembly
#fastq_dir=${dir}/fastq
#csv=${dir}/test_data/samples_1_basecalling.csv
#nextflow main.nf --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fastq ${fastq_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

#D) Assembly workflow (skip basecalling and demultiplexing step)
#csv=${dir}/test_data/samples_1.csv
#nextflow main.nf --samplesheet ${csv} --outdir ${out_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

#to restart the pipeline if something failed, use the -resume flag after correcting the issue
#nextflow main.nf -resume --samplesheet ${csv} --outdir ${out_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

4. Run the pipeline by submitting a job at Pawsey Zeus

sbatch nextflow_batch_template.sh

Optimisation required

MicroPIPE was originally developed on a cluster for which the jobs could be submitted to both CPU nodes and a GPU node. At Pawsey, the CPU and GPU nodes are accessed from different clusters ie Zeus (CPU) and Topaz (GPU). Therefore, a modified version of Nextflow, capable of submitting jobs to Zeus, Topaz and Magnus, has been installed as a system module.

  • The modified Nextflow module should be loaded prior to running the main nextflow command by using module load nextflow/20.07.1-multi.
  • The MicroPIPE pipeline will be launched using a Slurm script submitted to Zeus.
  • As a result, Nextflow will automatically submit the GPU tasks to Topaz and the CPU tasks to Zeus.

Here is a template script to hack Nextflow for multiple clusters (thanks to @marcodelapierre):
https://github.com/marcodelapierre/toy-gpu-nf/blob/master/extra/install-nextflow-hack-slurm-multi-cluster.sh


Infrastructure usage and benchmarking


Summary

Exemplar 1: Assembly of 12 E.coli ST131 samples using GPU and CPU resources

You can collect usage metrics from your Canu run using the NCI Gadi optimised workflow using scripts available on the Sydney Informatics Hub, University of Sydney GitHub repository.

Strain Chromosome/plasmid Size (bps) Circularised?
S24EC Chromosome
Plasmid A
5078304
114708
Yes
Yes
S34EC Chromosome
Plasmid A
Plasmid B
5050427
153321
108135
Yes
Yes
Yes
S37EC Chromosome
Plasmid A
Plasmid B
4981928
157642
61072
Yes
Yes
Yes
S39EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
Plasmid E
Plasmid F
5054402
141007
94979
68049
62085
2070
1846
Yes
Yes
Yes
Yes
Yes
Yes
Yes
S65EC Chromosome
Plasmid A
5205011
147412
Yes
Yes
S96EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
5069496
164355
115965
14479
4184
Yes
Yes
Yes
Yes
Yes
S97EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
5178868
166099
96788
4092
3209
Yes
Yes
Yes
Yes
Yes
S112EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
5020013
161028
68847
5338
4136
Yes
Yes
Yes
Yes
Yes
S116EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
4989207
66792
5263
4257
4104
Yes
Yes
Yes
Yes
Yes
S129EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
Plasmid E
Plasmid F
Plasmid G
5193964
163681
93505
33344
4087
2401
2121
1571
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
EC958 Chromosome
Plasmid A
Plasmid B
Plasmid C
5126816
136157
4145
1830
Yes
Yes
Yes
Yes
HVM2044 Chromosome
Plasmid A
Plasmid B
Plasmid C
5003288
142959
18716
18345
Yes
Yes
Yes
Yes

Exemplar 2: Assembly of 12 E.coli ST131 samples using CPU resources


Acknowledgements / citations / credits

  • The deployment of the workflow at the Pawsey Supercomputing Centre was supported by the Australian BioCommons via funding from Bioplatforms Australia, the Australian Research Data Commons (https://doi.org/10.47486/PL105) and the Queensland Government RICF programme. Bioplatforms Australia and the Australian Research Data Commons are funded by the National Collaborative Research Infrastructure Strategy (NCRIS).
  • Marco de la Pierre (Pawsey Supercomputing Centre) @marcodelapierre
  • Johan Gustafsson (Australian BioCommons) @supernord
Any attribution information that is relevant to the workflow being documented, or the infrastructure being used.