Skip to content

Output folders and files

Chris Jackson edited this page Jul 31, 2023 · 55 revisions

Documentation update in progress 28 November 2022

Your top-level results folder will be called results, unless you provided an alternative name via the --outdir <directory_name> option. It will contain a number of subfolders, which are described below.

  • 01_namelist

    This folder contains the text file namelist.txt, containing a list of sample names. These names are derived from the common prefix preceding the first underscore (_) in your read files. For example, if your read files are named:

    79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R1.fastq
    79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R2.fastq
    79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R1.fastq
    79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R2.fastq
    79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R1.fastq
    79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R2.fastq
    79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R1.fastq
    79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R2.fastq
    

    ...then namelist.txt will contain:

    79678
    79679
    
  • 02_reads_combined_lanes

    This folder will only be present if you've run the pipeline with the --combine_read_files flag. It contains read files that have been grouped and concatenated via a common file name prefix, defaulting to the text preceding the first underscore (_) in file names. This is useful if your samples have been run across multiple lanes, and you need to combine all forward and reverse files (separately, of course) for each sample prior to analysis. For example, given the reads set above, this folder will contain the files:

    79678_combinedLanes_R1.fastq
    79678_combinedLanes_R2.fastq
    79679_combinedLanes_R1.fastq
    79679_combinedLanes_R2.fastq
    

See here for more details.

  • 03a_trimmomatic_logs

    This folder will only be present if the pipeline was run with the --use_trimmomatic flag. It contains a Trimmomatic *.log file for each sample. These log files record the Trimmomatic primers and run parameters, as well as the number of reads passing QC. For example:

    TrimmomaticPE: Started with arguments:
    -phred33 -threads 10 79678_combinedLanes_R1.fastq 79678_combinedLanes_R2.fastq 79678_combinedLanes_R1_forward_paired.fq 79678_combinedLanes_R1_forward_unpaired.fq 79678_combinedLanes_R2_reverse_paired.fq 79678_combinedLanes_R2_reverse_unpaired.fq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36
    Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT>PE1TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
    Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA>PE2GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
    Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA'
    ILLUMINACLIP: Using 1 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
    Input Read Pairs: 336367 Both Surviving: 297073 (88.32%) Forward Only Surviving: 24179 (7.19%) Reverse Only Surviving: 6362 (1.89%) Dropped: 8753 (2.60%)
    TrimmomaticPE: Completed successfully
    
  • 03b_trimmomatic_paired_and_single_reads

    This folder will only be present if the pipeline was run with the --use_trimmomatic flag. If a directory of paired-end read files was used a pipeline input, this folder will contain a forwards and reverse file for paired reads passing the Trimmomatic QC, and a file of orphaned reads (i.e. those reads with a mate that didn't pass QC). For example:

    79678_combinedLanes_R1_forward_paired.fq
    79678_combinedLanes_R2_reverse_paired.fq
    79678_combinedLanes_R1-R2_unpaired.fq
    
  • 03c_trimmomatic_single_reads

    This folder will only be present if the pipeline was run with the --use_trimmomatic flag. If a directory of single-end read files was used a pipeline input, this folder will contain a file of reads passing the Trimmomatic QC. For example:

    79678_trimmed_single.fastq
    
  • 04_processed_gene_directories

    This folder contains a subfolder for each of your samples, corresponding to the output of running the original HybPiper command hybpiper assemble. The contents of these folders are described in the HybPiper wiki here.

  • 05_visualise

    This folder contains:

    • recovery_heatmap.png. A heatmap image in .png format showing gene recovery and length per sample, output by the HybPiper command hybpiper recovery_heatmap.
    • paralog_heatmap.png. A heatmap image in .png format depicting the number of long sequences recovered for each gene and sample, output by the HybPiper command hybpiper paralog_retriever.
  • 06_summary_stats

    This folder contains:

    • hybpiper_stats.tsv. A table in tab-separated-values format, containing statistics on the HybPiper run, output from the HybPiper command hybpiper stats. Column details are described in the HybPiper tutorial here.

    • seq_lengths.tsv. A table in tab-separated-values format, containing the lengths of each recovered gene sequence for each sample, along with the mean sequence length for each gene within the target file, output by the HybPiper command hybpiper stats.

    • A folder for each sample e.g. 79678 containing:

      • 79678_genes_stitched_contig.csv. A comma-separated-values file collated from per-gene stitched contig reports. See above for a more complete description.
      • 79678_genes_derived_from_putative_chimeric_stitched_contig.csv. A comma-separated-values file collated from per-gene putative chimeric stitched contig reports. See above for a more complete description.
      • 79678_genes_with_long_paralog_warnings.txt. A text file listing genes that had long paralog warnings.
      • 79678_genes_with_paralog_warnings_by_contig_depth.csv. A text file listing genes that had paralog warnings paralog warnings due to contig depth).
  • 07_sequences_dna

    This folder contains a nucleotide .fasta file for each gene in your target file, here with the suffix *.FNA. Each file contains the 'main' nucleotide sequence selected by HybPiper for this gene for all samples (where available). These sequences are not aligned.

  • 08_sequences_aa

    This folder contains an amino-acid .fasta file for each gene in your target file, here with the suffix *.FAA. Each file contains a protein translation of the 'main' nucleotide sequence selected by HybPiper for this gene for all samples (where available). These sequences are not aligned.

  • 09_sequences_intron

    If the pipeline was run with the --run_intronerate flag, this folder contains *_intron.fasta files for each gene in your target file. Each file contains putative introns from each sample, as recovered by the HybPiper command hybpiper assemble --run_intronerate (see here) followed by the command hybpiper retrieve_sequences with the option intron. Manually review these sequences and use with caution.

  • 10_sequences_supercontig

    If the pipeline was run with the --run_intronerate flag, this folder contains *_supercontig.fasta files for each gene in your target file. Each file contains a 'supercontig' from each sample, as recovered by the HybPiper command hybpiper assemble --run_intronerate (see here) followed by the script hybpiper retrieve_sequences with the option supercontig. Manually review these sequences and use with caution.

  • 11_paralogs

    This folder contains a nucleotide *_paralogs.fasta file for each gene in your target file. Each file contains the sequence** found in folder 07_sequences_dna, along with any additional sequences that HybPiper has flagged as putative paralogs. If such putative paralogs are present for a given sample, the .fasta header of the 'main' sequence selected by HybPiper will contain the suffix .main, whereas paralog sequence header will have the suffix .0, .1 etc. For example:

     >79679.0 NODE_1_length_1884_cov_18.097952,UMUL-7324,0,231,91.34,(+),422,1471
     ATGATGATGTGAGGTATGAGTGTGAAGAACTTTGATCCAGTCCGACATGCTGGAAGA...
     >79679.main NODE_2_length_1613_cov_13.093099,UMUL-7324,0,231,93.07,(+),171,1229
     ATGATGATGAGAGGTATGAGTGTGAAGAACTTTGATCCAGTCCGATATTCTGGGAGA...
     >79678.main NODE_1_length_1937_cov_18.187097,UMUL-7324,0,231,92.21,(+),372,1431
     ATGATGATGAGAGGTATGAGTGTGAAGAACTTTGATCCAGTCCGATATTCTGGGAGA...
     >79678.1 NODE_2_length_1636_cov_16.651700,UMUL-7324,13,231,88.07,(-),1400,453
     GTCCGATATGCTGGAAGATGGTTCGAGGTAGCTTCCCTTAAACGTGGGTTTGCTGGT...
    

    Note that the text following the space in the fasta headers (e.g. NODE_1_length_1884_cov_18.097952,UMUL-7324,0,231,91.34,(+),422,1471) corresponds to the sequence description, and is not part of the sequence name. It includes the SPAdes contig name, and details from Exonerate searches conducted as part of the HybPiper pipeline. You can ignore this information in this case.

    • logs. A directory containing the files:

      • paralog_report.tsv. A table in tab-separated-values format, containing the number of long sequences recovered for each gene and sample (i.e. potential paralogs if > 1)
      • paralogs_above_threshold_report.txt. A text file that lists 1) The number and names of genes with paralogs in a minimum percentage of samples; 2) The number and names of samples that have paralogs in a minimum percentage of genes. By default, this percentage is set to zero, so all genes and samples with paralogs (i.e. > 1 sequence) will be reported.

    ** NOTE: if paralogs are detected for a given gene/sample, the sequence in the paralog file with the suffix *.main will not necessarily be identical to the corresponding *.FNA sequence. This is because each paralog sequence is recovered from a single SPAdes contig only, whereas the *.FNA sequence could be derived from a stitched contig (comprising sequence from more than one SPAdes contig).

  • 12_paralogs_noChimeras

    As described here, this folder contains the same files/sequences as found in folder 11_paralogs, but with putative chimeric sequences removed.