Skip to content

Additional pipeline features and details

Chris Jackson edited this page Mar 24, 2021 · 46 revisions

Combining read files for samples run across multiple lanes

If your samples have been run across multiple Illumina lanes, you'll likely want to combine the read files for each sample before processing. To do this, use the pipeline flag --combine_read_files. Your read files will be grouped and concatenated via a common prefix; the default is all text preceding the first underscore (_) in read filenames. For example, the read files:

  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R1.fastq
  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R2.fastq
  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R1.fastq
  79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R2.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R1.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R2.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R1.fastq
  79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R2.fastq

...will be grouped and combined by the prefixes 79678 and 79679, producing the files:

  79678_combinedLanes_R1.fastq
  79678_combinedLanes_R2.fastq
  79679_combinedLanes_R1.fastq
  79679_combinedLanes_R2.fastq

These combined read files will be used as input to Trimmomatic (optional) and HybPiper-RBGV.

You can also specify the number of common prefix fields (as delimited by underscores) to use for read file grouping/concatenation using the parameter --combine_read_files_num_fields <int>. This is useful if your read files otherwise begin with a non-unique prefix, such as a genus name. For example, if providing the read files:

  genus_species1_L001_R1.fastq
  genus_species1_L001_R2.fastq
  genus_species1_L002_R1.fastq
  genus_species1_L002_R2.fastq
  genus_species2_L001_R1.fastq
  genus_species2_L001_R2.fastq
  genus_species2_L002_R1.fastq
  genus_species2_L002_R2.fastq

...you should use the options --combine_read_files and --combine_read_files_num_fields 2. This will result in read files grouped and combined by the prefixes genus_species1 and genus_species2, rather than both species being lumped together via the default prefix genus.

Trimming input reads using Trimmomatic

If supplying a folder of either paired-end OR single-end reads, users can optionally choose to trim their reads using the software Trimmomatic, by using the flag --use_trimmomatic.

  • At this stage, reads will be trimmed using the TruSeq3 primers as provided with the Trimmomatic download i.e. those in the file TruSeq3-PE-2.fa (paired-end reads) or TruSeq3-SE.fa (single-end reads). Let me know if additional primer sets would be useful.

  • If the flag --use_trimmomatic is used while providing paired-end reads, SPAdes assemblies will be run with forward and reverse trimmed reads, as well as a concatenated file of single-end orphaned reads (the latter referring to reads with a mate that did not pass the Trimmomatic filtering.

  • The default parameters for the Trimmomatic run are:

    ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3  SLIDINGWINDOW:4:20 MINLEN:36
    

    ...for paired-end reads OR

    ILLUMINACLIP:TruSeq3-SE.fa:2:30:10:1:true LEADING:3 TRAILING:3  SLIDINGWINDOW:4:20 MINLEN:36
    

    ---for single-end reads.

  • These parameters can be changed using the hybpiper_pipeline_v1_7.nf pipeline options:

    --trimmomatic_leading_quality <int>
    --trimmomatic_trailing_quality <int>
    --trimmomatic_min_length <int>
    --trimmomatic_sliding_window_size <int>
    --trimmomatic_sliding_window_quality <int>
    

Using BLASTx [RECOMMENDED] to map reads rather than BWA

Describe increased sensitivity, increased contig recovery (presence and/or length).

Describe flags --use_blastx, --translate_target_file_for_blastx and --blastx_evalue <value>.

Reducing the coverage cut-off for SPAdes assembly

Describe --cov_cutoff <int> flag, and pros and cons of altering the default of 8.

Running the pipeline with the --nosupercontigs flag

By default, in cases where a single SPAdes contig covering most of a target locus isn't recovered for a given sample/gene, HybPiper will attempt to reconstruct as much sequence as possible by stitching together multiple contigs corresponding to different regions of the locus. During this process is creates 'supercontigs'; see the figure at the mossmatters Wiki for a general overview.

This approach is not without its downsides, particularly for samples that have high levels of paralogy. In such cases, creating 'supercontigs' can result in contigs from different paralogs being stitched together, resulting in 'chimeric' locus sequences. This potentially results in reduced phylogenetic resolution or misleading results. The hybpiper-rbgv-pipeline.nf pipeline provides the optional flag --nosupercontigs. When used, 'supercontigs' will not be constructed and only the single longest contig will be returned for each sample/gene. Note: In cases where your target file corresponds to multi-exon genes with long introns, running the pipeline with the --nosupercontigs can dramatically reduce the total loci sequence length recovered.

Detection of putative chimeric contigs

As described above, HybPiper can occasionally construct chimeric sequences for a given locus by stitching together contigs corresponding to different paralogs.

The hybpiper-rbgv-pipeline.nf pipeline implements a (currently very rough and not very sensitive) method to detect such putative chimeras. Details, including the effect of pipeline parameters --discordant_reads_edit_distance <int> and --discordant_reads_cutoff <int>, are described here.

After running the pipeline and recovering putative paralog sequences, all 'main' loci sequences (regardless of whether they've been flagged as putative chimeras or not), together with putative paralogs, are recovered and are available as per-target .fasta files in the results subfolder 11_paralogs.

The results subfolder 12_paralogs_noChimeras contains the same data as folder 11_paralogs, except that putative chimeric sequences have been removed.

Adjusting paralog detection settings

HybPiper performs a rough test for the presence of paralogs in a SPAdes assembly for a given sample and gene. It does this by searching for more than one contig matching a given target file reference, where each contig is greater than 75% of the reference length, as described here (note that the Wiki states 85%, but this value is hard-coded as 75% in the default HybPiper code (see exonerate_hits.py here).

The hybpiper-rbgv-pipeline.nf pipeline allows this percentage cut-off to be specified using the --paralog_warning_min_len_percent <decimal> parameter. The default value is 0.75.

Merging reads prior to SPAdes assembly

When paired-end sequencing is generated using libraries with short insert sizes, the forwards and reverse reads of a given pair can overlap. In such cases, target locus recovery can be improved when SPAdes assemblies are performed with merged reads*, that is, when a single contig is produced from overlapping pairs when possible. This can be done by providing the hybpiper-rbgv-pipeline.nf pipeline with the flag --merged. If used, paired-end reads will be merged with bbmerge.sh using default settings, and SPAdes assemblies will be carried out with both merged and unmerged output.

*CJJ: benchmarking is required here; this is currently word-of-mouth although it makes intuitive sense...

Managing computing resources

The config file hybpiper-rbgv.config can be edited to suit the computing resources available on your local machine, and it can also be used to specify resource requests when running the pipeline and submitting jobs via a scheduling system such as SLURM. Note that this is a feature of Nextflow and applies to all Nextflow scripts as described here. This can be achieved by defining profiles in the config file; the hybpiper-rbgv.config file provided has the profiles standard and slurm.

By default, if you run the hybpiper-rbgv-pipeline.nf script without specifying a profile (e.g. by using the parameter -profile slurm), Nextflow will use the profile standard. If you open the hybpiper-rbgv.config in a text editor and find the definition of the standard profile (line beginned with standard {), you'll see it's possible to specify resources for each Nextflow process that is defined in the hybpiper-rbgv-pipeline.nf script. For example:

 standard {
         process {
             withName: reads_first_with_unpaired{
                 cpus = { 2 * task.attempt }
                 memory = { 2.GB * task.attempt }
                 errorStrategy  = { task.exitStatus in 137..141 ? 'retry' : 'terminate' }
                 maxRetries = 3
             }

             ...(some processes not shown)...

             withName: summary_stats {
                 cpus = { 1 * task.attempt }
                 memory = { 1.GB * task.attempt }
                 errorStrategy  = { task.exitStatus in 137..143 ? 'retry' : 'terminate' }
                 maxRetries = 3
             ...etc

Here, you can see that there are specific resources allocated to the processes reads_first_with_unpaired and summary_stats. As you might expect (and can directly view by opening the hybpiper-rbgv-pipeline.nf script in a text editor), these processes execute the HybPiper scripts reads_first.py and hybpiper_stats.py, respectively. If you are editing the standard profile to better suit the resources on your local machine, the main values to change will be the number of cpus and the memory (RAM).

If you look at the slurm profile:

 slurm {
    process {
        withName: reads_first_with_unpaired{
            cpus = { 30 * task.attempt }
            memory = { 30.GB * task.attempt }
            errorStrategy  = { task.exitStatus in 137..141 ? 'retry' : 'terminate' }
            maxRetries = 3
            time = '24h'
        }
 ...etc

...you'll see there's an extra important parameter: time. Most HPC scheduling systems require the user to specify a desired wall-time; you might need a bit of trial and error to work out the appropriate time requests for you given dataset and the wall-time limits of your HPC. Other options can also be specified as described in the Nextflow documentation here.

By default, Nextflow will try to run as many processes in parallel as possible, equal to the number of CPU cores available minus 1. For example, if you provide read forward and reverse read files for 10 samples, the process reads_first_with_unpaired will be launched as ten parallel tasks (one for each sample) if computing resources allow. If you want to limit this behaviour (e.g. to keep some computing resources free for other tasks), use the hybpiper-rbgv-pipeline.nf parameter --num_forks <int>. For example, if you provide the parameter --num_forks 2, only two process 'instances' will be run for reads_first_with_unpaired in parallel (each instance using 2 cpus and 2GB RAM if you're using the unaltered standard profile).