microPIPE on Zeus/Topaz @ Pawsey

Accessing tool/workflow

The workflow can be downloaded from the GitHub page https://github.com/BeatsonLab-MicrobialGenomics/micropipe using the command:

git clone https://github.com/BeatsonLab-MicrobialGenomics/micropipe.git

Installation

Nextflow
A modified version of Nextflow, capable of submitting jobs to Zeus, Topaz and Magnus, has been installed as a system module and can be accessed with the command:

module load nextflow/20.07.1-multi

Singularity
Singularity has been installed as a system module and can be accessed with the command:

module load singularity/3.6.4

Guppy (3.6.1 was the latest working version)
Due to the Oxford Nanopore Technologies terms and conditions, we are not allowed to redistribute the Guppy software either in its binary form or packaged form e.g. Docker or Singularity images. Therefore users will have to either install Guppy, provide a container image or start the pipeline from the basecalled fastq files. See Usage section for instructions.

Quickstart tutorial

A tutorial is available on the GitHub page: https://github.com/BeatsonLab-MicrobialGenomics/micropipe#usage. The steps are summarised below including the specific instructions required to run the pipeline at Pawsey Zeus.

1. Prepare the Nextflow configuration file (nextflow.config)
Use the configuration file to run microPIPE at Pawsey Zeus here.

2. Prepare the samplesheet file (csv)
See instructions at the microPIPE GitHub page, section 2. Prepare the samplesheet file.

3. Prepare the slurm script (e.g. nextflow_batch_template.sh)
The pipeline will be launched using a Slurm script submitted to Zeus. This script will load the required modules, define the input/output directories and files, and include the nextflow command line with optional parameters. Note that the configuration profile definition for the Zeus cluster should be specified when launching the pipeline execution by using the "-profile zeus" command line option, as well as the slurm account allocation by using the "--slurm_account='director2172'" command line option (replace 'director2172' by your account identifier).

#!/bin/bash

#SBATCH --job-name=micropipe
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --output=s%A.micropipe_guppy3.6.1_gpu_12samples.out
#SBATCH --error=s%A.micropipe_guppy3.6.1_gpu_12samples.err
#SBATCH --time=24:00:00

module load nextflow/20.07.1-multi
module load singularity/3.6.4 

#directory containing the nextflow.config file and the main.nf script
dir=/scratch/director2172/vmurigneux/micropipe
cd ${dir}
datadir=${dir}/Illumina
out_dir=${dir}/results_3.6.1_gpu

#Run A, B or C depending on whether you are starting with ONT fast5 (A or B) or fastq files (C or D) 

#A) Workflow including basecalling, demultiplexing and assembly
fast5_dir=${dir}/fast5_pass
csv=${dir}/test_data/samples_all_basecalling.csv
nextflow main.nf --gpu true --basecalling  -profile zeus --slurm_account='director2172' --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir}
#nextflow main.nf --gpu false --basecalling --guppy_num_callers 16 -profile zeus --slurm_account='director2172' --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir}

#B) Workflow including basecalling and assembly (skip demultiplexing step)
#fast5_dir=${dir}/fast5_pass
#csv=${dir}/test_data/samples_1_basecalling_single_isolate.csv
#nextflow main.nf --basecalling --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

#C) Workflow including demultiplexing and assembly
#fastq_dir=${dir}/fastq
#csv=${dir}/test_data/samples_1_basecalling.csv
#nextflow main.nf --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fastq ${fastq_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

#D) Assembly workflow (skip basecalling and demultiplexing step)
#csv=${dir}/test_data/samples_1.csv
#nextflow main.nf --samplesheet ${csv} --outdir ${out_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

#to restart the pipeline if something failed, use the -resume flag after correcting the issue
#nextflow main.nf -resume --samplesheet ${csv} --outdir ${out_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'

4. Run the pipeline by submitting a job at Pawsey Zeus

sbatch nextflow_batch_template.sh

Optimisation required

MicroPIPE was originally developed on a cluster for which the jobs could be submitted to both CPU nodes and a GPU node. At Pawsey, the CPU and GPU nodes are accessed from different clusters ie Zeus (CPU) and Topaz (GPU). Therefore, a modified version of Nextflow, capable of submitting jobs to Zeus, Topaz and Magnus, has been installed as a system module.

The modified Nextflow module should be loaded prior to running the main nextflow command by using module load nextflow/20.07.1-multi.
The MicroPIPE pipeline will be launched using a Slurm script submitted to Zeus.
As a result, Nextflow will automatically submit the GPU tasks to Topaz and the CPU tasks to Zeus.

Here is a template script to hack Nextflow for multiple clusters (thanks to @marcodelapierre):
https://github.com/marcodelapierre/toy-gpu-nf/blob/master/extra/install-nextflow-hack-slurm-multi-cluster.sh

Infrastructure usage and benchmarking

Summary

Exemplar 1: Assembly of 12 E.coli ST131 samples using GPU and CPU resources

You can collect usage metrics from your Canu run using the NCI Gadi optimised workflow using scripts available on the Sydney Informatics Hub, University of Sydney GitHub repository.

We used the E.coli data from the microPIPE publication available from the NCBI SRA BioProject PRJNA679678 (Oxford Nanopore) and the BioProject PRJEB2968 (Illumina).
See Nextflow configuration file used here and slurm submission script here.
See Nextflow HTML execution report, trace report and HTML processes execution timeline.
The table below summarised the assembly results for each strain.

Strain	Chromosome/plasmid	Size (bps)	Circularised?
S24EC	Chromosome Plasmid A	5078304 114708	Yes Yes
S34EC	Chromosome Plasmid A Plasmid B	5050427 153321 108135	Yes Yes Yes
S37EC	Chromosome Plasmid A Plasmid B	4981928 157642 61072	Yes Yes Yes
S39EC	Chromosome Plasmid A Plasmid B Plasmid C Plasmid D Plasmid E Plasmid F	5054402 141007 94979 68049 62085 2070 1846	Yes Yes Yes Yes Yes Yes Yes
S65EC	Chromosome Plasmid A	5205011 147412	Yes Yes
S96EC	Chromosome Plasmid A Plasmid B Plasmid C Plasmid D	5069496 164355 115965 14479 4184	Yes Yes Yes Yes Yes
S97EC	Chromosome Plasmid A Plasmid B Plasmid C Plasmid D	5178868 166099 96788 4092 3209	Yes Yes Yes Yes Yes
S112EC	Chromosome Plasmid A Plasmid B Plasmid C Plasmid D	5020013 161028 68847 5338 4136	Yes Yes Yes Yes Yes
S116EC	Chromosome Plasmid A Plasmid B Plasmid C Plasmid D	4989207 66792 5263 4257 4104	Yes Yes Yes Yes Yes
S129EC	Chromosome Plasmid A Plasmid B Plasmid C Plasmid D Plasmid E Plasmid F Plasmid G	5193964 163681 93505 33344 4087 2401 2121 1571	Yes Yes Yes Yes Yes Yes Yes Yes
EC958	Chromosome Plasmid A Plasmid B Plasmid C	5126816 136157 4145 1830	Yes Yes Yes Yes
HVM2044	Chromosome Plasmid A Plasmid B Plasmid C	5003288 142959 18716 18345	Yes Yes Yes Yes

Exemplar 2: Assembly of 12 E.coli ST131 samples using CPU resources

See Nextflow configuration file used here and slurm submission script here.
See Nextflow HTML execution report, trace report and HTML processes execution timeline.

Acknowledgements / citations / credits

The deployment of the workflow at the Pawsey Supercomputing Centre was supported by the Australian BioCommons via funding from Bioplatforms Australia, the Australian Research Data Commons (https://doi.org/10.47486/PL105) and the Queensland Government RICF programme. Bioplatforms Australia and the Australian Research Data Commons are funded by the National Collaborative Research Infrastructure Strategy (NCRIS).
Marco de la Pierre (Pawsey Supercomputing Centre) @marcodelapierre
Johan Gustafsson (Australian BioCommons) @supernord

Any attribution information that is relevant to the workflow being documented, or the infrastructure being used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infrastructure_optimisation_zeus.md

infrastructure_optimisation_zeus.md

microPIPE on Zeus/Topaz @ Pawsey

Accessing tool/workflow

Installation

Quickstart tutorial

Optimisation required

Infrastructure usage and benchmarking

Summary

Exemplar 1: Assembly of 12 E.coli ST131 samples using GPU and CPU resources

Exemplar 2: Assembly of 12 E.coli ST131 samples using CPU resources

Acknowledgements / citations / credits

Files

infrastructure_optimisation_zeus.md

Latest commit

History

infrastructure_optimisation_zeus.md

File metadata and controls

microPIPE on Zeus/Topaz @ Pawsey

Accessing tool/workflow

Installation

Quickstart tutorial

Optimisation required

Infrastructure usage and benchmarking

Summary

Exemplar 1: Assembly of 12 E.coli ST131 samples using GPU and CPU resources

Exemplar 2: Assembly of 12 E.coli ST131 samples using CPU resources

Acknowledgements / citations / credits