The workflow can be downloaded from the GitHub page https://github.com/BeatsonLab-MicrobialGenomics/micropipe using the command:
git clone https://github.com/BeatsonLab-MicrobialGenomics/micropipe.git
- Nextflow
A modified version of Nextflow, capable of submitting jobs to Zeus, Topaz and Magnus, has been installed as a system module and can be accessed with the command:
module load nextflow/20.07.1-multi
- Singularity
Singularity has been installed as a system module and can be accessed with the command:
module load singularity/3.6.4
- Guppy (3.6.1 was the latest working version)
Due to the Oxford Nanopore Technologies terms and conditions, we are not allowed to redistribute the Guppy software either in its binary form or packaged form e.g. Docker or Singularity images. Therefore users will have to either install Guppy, provide a container image or start the pipeline from the basecalled fastq files. See Usage section for instructions.
A tutorial is available on the GitHub page: https://github.com/BeatsonLab-MicrobialGenomics/micropipe#usage. The steps are summarised below including the specific instructions required to run the pipeline at Pawsey Zeus.
1. Prepare the Nextflow configuration file (nextflow.config)
Use the configuration file to run microPIPE at Pawsey Zeus here.
2. Prepare the samplesheet file (csv)
See instructions at the microPIPE GitHub page, section 2. Prepare the samplesheet file.
3. Prepare the slurm script (e.g. nextflow_batch_template.sh)
The pipeline will be launched using a Slurm script submitted to Zeus. This script will load the required modules, define the input/output directories and files, and include the nextflow command line with optional parameters. Note that the configuration profile definition for the Zeus cluster should be specified when launching the pipeline execution by using the "-profile zeus" command line option, as well as the slurm account allocation by using the "--slurm_account='director2172'" command line option (replace 'director2172' by your account identifier).
#!/bin/bash
#SBATCH --job-name=micropipe
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --output=s%A.micropipe_guppy3.6.1_gpu_12samples.out
#SBATCH --error=s%A.micropipe_guppy3.6.1_gpu_12samples.err
#SBATCH --time=24:00:00
module load nextflow/20.07.1-multi
module load singularity/3.6.4
#directory containing the nextflow.config file and the main.nf script
dir=/scratch/director2172/vmurigneux/micropipe
cd ${dir}
datadir=${dir}/Illumina
out_dir=${dir}/results_3.6.1_gpu
#Run A, B or C depending on whether you are starting with ONT fast5 (A or B) or fastq files (C or D)
#A) Workflow including basecalling, demultiplexing and assembly
fast5_dir=${dir}/fast5_pass
csv=${dir}/test_data/samples_all_basecalling.csv
nextflow main.nf --gpu true --basecalling -profile zeus --slurm_account='director2172' --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir}
#nextflow main.nf --gpu false --basecalling --guppy_num_callers 16 -profile zeus --slurm_account='director2172' --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir}
#B) Workflow including basecalling and assembly (skip demultiplexing step)
#fast5_dir=${dir}/fast5_pass
#csv=${dir}/test_data/samples_1_basecalling_single_isolate.csv
#nextflow main.nf --basecalling --samplesheet ${csv} --outdir ${out_dir} --fast5 ${fast5_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'
#C) Workflow including demultiplexing and assembly
#fastq_dir=${dir}/fastq
#csv=${dir}/test_data/samples_1_basecalling.csv
#nextflow main.nf --demultiplexing --samplesheet ${csv} --outdir ${out_dir} --fastq ${fastq_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'
#D) Assembly workflow (skip basecalling and demultiplexing step)
#csv=${dir}/test_data/samples_1.csv
#nextflow main.nf --samplesheet ${csv} --outdir ${out_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'
#to restart the pipeline if something failed, use the -resume flag after correcting the issue
#nextflow main.nf -resume --samplesheet ${csv} --outdir ${out_dir} --datadir ${datadir} -profile zeus --slurm_account='director2172'
4. Run the pipeline by submitting a job at Pawsey Zeus
sbatch nextflow_batch_template.sh
MicroPIPE was originally developed on a cluster for which the jobs could be submitted to both CPU nodes and a GPU node. At Pawsey, the CPU and GPU nodes are accessed from different clusters ie Zeus (CPU) and Topaz (GPU). Therefore, a modified version of Nextflow, capable of submitting jobs to Zeus, Topaz and Magnus, has been installed as a system module.
- The modified Nextflow module should be loaded prior to running the main nextflow command by using
module load nextflow/20.07.1-multi
. - The MicroPIPE pipeline will be launched using a Slurm script submitted to Zeus.
- As a result, Nextflow will automatically submit the GPU tasks to Topaz and the CPU tasks to Zeus.
Here is a template script to hack Nextflow for multiple clusters (thanks to @marcodelapierre):
https://github.com/marcodelapierre/toy-gpu-nf/blob/master/extra/install-nextflow-hack-slurm-multi-cluster.sh
You can collect usage metrics from your Canu run using the NCI Gadi optimised workflow using scripts available on the Sydney Informatics Hub, University of Sydney GitHub repository.
-
We used the E.coli data from the microPIPE publication available from the NCBI SRA BioProject PRJNA679678 (Oxford Nanopore) and the BioProject PRJEB2968 (Illumina).
-
See Nextflow configuration file used here and slurm submission script here.
-
See Nextflow HTML execution report, trace report and HTML processes execution timeline.
-
The table below summarised the assembly results for each strain.
Strain | Chromosome/plasmid | Size (bps) | Circularised? |
---|---|---|---|
S24EC | Chromosome Plasmid A |
5078304 114708 |
Yes Yes |
S34EC | Chromosome Plasmid A Plasmid B |
5050427 153321 108135 |
Yes Yes Yes |
S37EC | Chromosome Plasmid A Plasmid B |
4981928 157642 61072 |
Yes Yes Yes |
S39EC | Chromosome Plasmid A Plasmid B Plasmid C Plasmid D Plasmid E Plasmid F |
5054402 141007 94979 68049 62085 2070 1846 |
Yes Yes Yes Yes Yes Yes Yes |
S65EC | Chromosome Plasmid A |
5205011 147412 |
Yes Yes |
S96EC | Chromosome Plasmid A Plasmid B Plasmid C Plasmid D |
5069496 164355 115965 14479 4184 |
Yes Yes Yes Yes Yes |
S97EC | Chromosome Plasmid A Plasmid B Plasmid C Plasmid D |
5178868 166099 96788 4092 3209 |
Yes Yes Yes Yes Yes |
S112EC | Chromosome Plasmid A Plasmid B Plasmid C Plasmid D |
5020013 161028 68847 5338 4136 |
Yes Yes Yes Yes Yes |
S116EC | Chromosome Plasmid A Plasmid B Plasmid C Plasmid D |
4989207 66792 5263 4257 4104 |
Yes Yes Yes Yes Yes |
S129EC | Chromosome Plasmid A Plasmid B Plasmid C Plasmid D Plasmid E Plasmid F Plasmid G |
5193964 163681 93505 33344 4087 2401 2121 1571 |
Yes Yes Yes Yes Yes Yes Yes Yes |
EC958 | Chromosome Plasmid A Plasmid B Plasmid C |
5126816 136157 4145 1830 |
Yes Yes Yes Yes |
HVM2044 | Chromosome Plasmid A Plasmid B Plasmid C |
5003288 142959 18716 18345 |
Yes Yes Yes Yes |
-
See Nextflow configuration file used here and slurm submission script here.
-
See Nextflow HTML execution report, trace report and HTML processes execution timeline.
- The deployment of the workflow at the Pawsey Supercomputing Centre was supported by the Australian BioCommons via funding from Bioplatforms Australia, the Australian Research Data Commons (https://doi.org/10.47486/PL105) and the Queensland Government RICF programme. Bioplatforms Australia and the Australian Research Data Commons are funded by the National Collaborative Research Infrastructure Strategy (NCRIS).
- Marco de la Pierre (Pawsey Supercomputing Centre) @marcodelapierre
- Johan Gustafsson (Australian BioCommons) @supernord
Any attribution information that is relevant to the workflow being documented, or the infrastructure being used.