-
Notifications
You must be signed in to change notification settings - Fork 2
Additional pipeline features and details
The hybpiper.nf
Nextflow script supports all command line parameters available for the native HybPiper command hybpiper assemble
. To view these command line parameters, please see the HybPiper wiki here.
When using HybPiper (and hence the hybpiper-nf
pipeline) there are several important issues to consider when trying to maximise both computing efficiency and locus recovery. Please see the relevant HybPiper wiki here.
If your samples have been run across multiple Illumina lanes, you'll likely want to combine the read files for each sample before processing. To do this, use the pipeline flag --combine_read_files
. Your read files will be grouped and concatenated via a common prefix; the default is all text preceding the first underscore (_
) in read filenames. For example, the read files:
79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R1.fastq
79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R2.fastq
79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R1.fastq
79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R2.fastq
79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R1.fastq
79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R2.fastq
79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R1.fastq
79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R2.fastq
...will be grouped and combined by the prefixes 79678
and 79679
, producing the files:
79678_combinedLanes_R1.fastq
79678_combinedLanes_R2.fastq
79679_combinedLanes_R1.fastq
79679_combinedLanes_R2.fastq
These combined read files will be used as input to Trimmomatic (optional) and hybpiper-nf
.
You can also specify the number of common prefix fields (as delimited by underscores) to use for read file grouping/concatenation using the parameter --combine_read_files_num_fields <int>
. This is useful if your read files otherwise begin with a non-unique prefix, such as a genus name. For example, if providing the read files:
genus_species1_L001_R1.fastq
genus_species1_L001_R2.fastq
genus_species1_L002_R1.fastq
genus_species1_L002_R2.fastq
genus_species2_L001_R1.fastq
genus_species2_L001_R2.fastq
genus_species2_L002_R1.fastq
genus_species2_L002_R2.fastq
...you should use the options --combine_read_files
and --combine_read_files_num_fields 2
. This will result in read files grouped and combined by the prefixes genus_species1
and genus_species2
, rather than both species being lumped together via the default prefix genus
.
If supplying a folder of either paired-end OR single-end reads, users can optionally choose to trim their reads using the software Trimmomatic, by using the flag --use_trimmomatic
.
-
At this stage, reads will be trimmed using the
TruSeq3
primers as provided with the Trimmomatic download i.e. those in the fileTruSeq3-PE-2.fa
(paired-end reads) orTruSeq3-SE.fa
(single-end reads). Let me know if additional primer sets would be useful. -
If the flag
--use_trimmomatic
is used while providing paired-end reads, SPAdes assemblies will be run with forward and reverse trimmed reads, as well as a concatenated file of single-end orphaned reads (the latter referring to reads with a mate that did not pass the Trimmomatic filtering. -
The default parameters for the Trimmomatic run are:
ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36
...for paired-end reads OR
ILLUMINACLIP:TruSeq3-SE.fa:2:30:10:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36
---for single-end reads.
-
These parameters can be changed using the
hybpiper-rbgv-pipeline.nf
pipeline options:--trimmomatic_leading_quality <int> --trimmomatic_trailing_quality <int> --trimmomatic_min_length <int> --trimmomatic_sliding_window_size <int> --trimmomatic_sliding_window_quality <int>
The config file hybpiper.config
can be edited to suit the computing resources available on your local machine, and it can also be used to specify resource requests when running the pipeline and submitting jobs via a scheduling system such as SLURM. Note that this is a feature of Nextflow and applies to all Nextflow scripts as described here. This can be achieved by defining profiles
in the config file; the hybpiper.config
file provided has the profiles standard
and slurm
.
By default, if you run the hybpiper.nf
script without specifying a profile (e.g. by using the parameter -profile slurm
), Nextflow will use the profile standard
. If you open the hybpiper.config
in a text editor and find the definition of the standard
profile (line beginning with standard {
), you'll see it's possible to specify resources for each Nextflow process
that is defined in the hybpiper.nf
script. For example:
standard {
process {
withName: ASSEMBLE_PAIRED_AND_SINGLE_END{
cpus = { 2 * task.attempt }
memory = { 2.GB * task.attempt }
errorStrategy = { task.exitStatus in 137..141 ? 'retry' : 'terminate' }
maxRetries = 3
}
...(some processes not shown)...
withName: SUMMARY_STATS {
cpus = { 1 * task.attempt }
memory = { 1.GB * task.attempt }
errorStrategy = { task.exitStatus in 137..143 ? 'retry' : 'terminate' }
maxRetries = 3
...etc
Here, you can see that there are specific resources allocated to the processes ASSEMBLE_PAIRED_AND_SINGLE_END
and SUMMARY_STATS
. As you might expect (and can directly view by opening the hybpiper.nf
script in a text editor), these processes execute the HybPiper commands hybpiper assemble
and hybpiper stats
, respectively. If you are editing the standard
profile to better suit the resources on your local machine, the main values to change will be the number of cpus and the memory (RAM).
If you look at the slurm
profile:
slurm {
process {
withName: ASSEMBLE_PAIRED_AND_SINGLE_END{
cpus = { 30 * task.attempt }
memory = { 30.GB * task.attempt }
errorStrategy = { task.exitStatus in 137..141 ? 'retry' : 'terminate' }
maxRetries = 3
time = '24h'
}
...etc
...you'll see there's an extra important parameter: time. Most HPC scheduling systems require the user to specify a desired wall-time; you might need a bit of trial and error to work out the appropriate time requests for you given dataset and the wall-time limits of your HPC. Other options can also be specified as described in the Nextflow documentation here.
By default, Nextflow will try to run as many processes in parallel as possible, equal to the number of CPU cores available minus 1. For example, if you provide forward and reverse read files for 10 samples, the process ASSEMBLE_PAIRED_AND_SINGLE_END
will be launched as ten parallel tasks (one for each sample) if computing resources allow. If you want to limit this behaviour (e.g. to keep some computing resources free for other tasks), use the hybpiper.nf
parameter --num_forks <int>
. For example, if you provide the parameter --num_forks 2
, only two process 'instances' will be run for ASSEMBLE_PAIRED_AND_SINGLE_END
in parallel (each instance using 2 cpus and 2GB RAM if you're using the unaltered standard
profile).