Skip to content

Additional pipeline features and details

Chris Jackson edited this page Nov 29, 2022 · 46 revisions

General assembly commands

The hybpiper.nf Nextflow script supports all command line parameters available for the native HybPiper command hybpiper assemble. To view these command line parameters, please see the HybPiper wiki here.

Troubleshooting, common issues, and recommendations

When using HybPiper (and hence the hybpiper-nf pipeline) there are several important issues to consider when trying to maximise both computing efficiency and locus recovery. Please see the relevant HybPiper wiki here.

Combining read files for samples run across multiple lanes

If your samples have been run across multiple Illumina lanes, you'll likely want to combine the read files for each sample before processing. To do this, use the pipeline flag --combine_read_files. Your read files will be grouped and concatenated via a common prefix; the default is all text preceding the first underscore (_) in read filenames. For example, the read files:

  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R1.fastq
  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R2.fastq
  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R1.fastq
  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R2.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R1.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R2.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R1.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R2.fastq

...will be grouped and combined by the prefixes 79678 and 79679, producing the files:

  79678_combinedLanes_R1.fastq
  79678_combinedLanes_R2.fastq
  79679_combinedLanes_R1.fastq
  79679_combinedLanes_R2.fastq

These combined read files will be used as input to Trimmomatic (optional) and hybpiper-nf.

You can also specify the number of common prefix fields (as delimited by underscores) to use for read file grouping/concatenation using the parameter --combine_read_files_num_fields <int>. This is useful if your read files otherwise begin with a non-unique prefix, such as a genus name. For example, if providing the read files:

  genus_species1_L001_R1.fastq
  genus_species1_L001_R2.fastq
  genus_species1_L002_R1.fastq
  genus_species1_L002_R2.fastq
  genus_species2_L001_R1.fastq
  genus_species2_L001_R2.fastq
  genus_species2_L002_R1.fastq
  genus_species2_L002_R2.fastq

...you should use the options --combine_read_files and --combine_read_files_num_fields 2. This will result in read files grouped and combined by the prefixes genus_species1 and genus_species2, rather than both species being lumped together via the default prefix genus.

Trimming input reads using Trimmomatic

If supplying a folder of either paired-end OR single-end reads, users can optionally choose to trim their reads using the software Trimmomatic, by using the flag --use_trimmomatic.

  • At this stage, reads will be trimmed using the TruSeq3 primers as provided with the Trimmomatic download i.e. those in the file TruSeq3-PE-2.fa (paired-end reads) or TruSeq3-SE.fa (single-end reads). Let me know if additional primer sets would be useful.

  • If the flag --use_trimmomatic is used while providing paired-end reads, SPAdes assemblies will be run with forward and reverse trimmed reads, as well as a concatenated file of single-end orphaned reads (the latter referring to reads with a mate that did not pass the Trimmomatic filtering.

  • The default parameters for the Trimmomatic run are:

    ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3  SLIDINGWINDOW:4:20 MINLEN:36
    

    ...for paired-end reads OR

    ILLUMINACLIP:TruSeq3-SE.fa:2:30:10:1:true LEADING:3 TRAILING:3  SLIDINGWINDOW:4:20 MINLEN:36
    

    ---for single-end reads.

  • These parameters can be changed using the hybpiper-rbgv-pipeline.nf pipeline options:

    --trimmomatic_leading_quality <int>
    --trimmomatic_trailing_quality <int>
    --trimmomatic_min_length <int>
    --trimmomatic_sliding_window_size <int>
    --trimmomatic_sliding_window_quality <int>
    

Managing computing resources

The config file hybpiper.config can be edited to suit the computing resources available on your local machine, and it can also be used to specify resource requests when running the pipeline and submitting jobs via a scheduling system such as SLURM. Note that this is a feature of Nextflow and applies to all Nextflow scripts as described here. This can be achieved by defining profiles in the config file; the hybpiper.config file provided has the profiles standard and slurm.

By default, if you run the hybpiper.nf script without specifying a profile (e.g. by using the parameter -profile slurm), Nextflow will use the profile standard. If you open the hybpiper.config in a text editor and find the definition of the standard profile (line beginning with standard {), you'll see it's possible to specify resources for each Nextflow process that is defined in the hybpiper.nf script. For example:

 standard {
         process {
             withName: ASSEMBLE_PAIRED_AND_SINGLE_END{
                 cpus = { 2 * task.attempt }
                 memory = { 2.GB * task.attempt }
                 errorStrategy  = { task.exitStatus in 137..141 ? 'retry' : 'terminate' }
                 maxRetries = 3
             }

             ...(some processes not shown)...

             withName: SUMMARY_STATS {
                 cpus = { 1 * task.attempt }
                 memory = { 1.GB * task.attempt }
                 errorStrategy  = { task.exitStatus in 137..143 ? 'retry' : 'terminate' }
                 maxRetries = 3
             ...etc

Here, you can see that there are specific resources allocated to the processes ASSEMBLE_PAIRED_AND_SINGLE_END and SUMMARY_STATS. As you might expect (and can directly view by opening the hybpiper.nf script in a text editor), these processes execute the HybPiper commands hybpiper assemble and hybpiper stats, respectively. If you are editing the standard profile to better suit the resources on your local machine, the main values to change will be the number of cpus and the memory (RAM).

If you look at the slurm profile:

 slurm {
    process {
        withName: ASSEMBLE_PAIRED_AND_SINGLE_END{
            cpus = { 30 * task.attempt }
            memory = { 30.GB * task.attempt }
            errorStrategy  = { task.exitStatus in 137..141 ? 'retry' : 'terminate' }
            maxRetries = 3
            time = '24h'
        }
 ...etc

...you'll see there's an extra important parameter: time. Most HPC scheduling systems require the user to specify a desired wall-time; you might need a bit of trial and error to work out the appropriate time requests for you given dataset and the wall-time limits of your HPC. Other options can also be specified as described in the Nextflow documentation here.

By default, Nextflow will try to run as many processes in parallel as possible, equal to the number of CPU cores available minus 1. For example, if you provide forward and reverse read files for 10 samples, the process ASSEMBLE_PAIRED_AND_SINGLE_END will be launched as ten parallel tasks (one for each sample) if computing resources allow. If you want to limit this behaviour (e.g. to keep some computing resources free for other tasks), use the hybpiper.nf parameter --num_forks <int>. For example, if you provide the parameter --num_forks 2, only two process 'instances' will be run for ASSEMBLE_PAIRED_AND_SINGLE_END in parallel (each instance using 2 cpus and 2GB RAM if you're using the unaltered standard profile).