-
Notifications
You must be signed in to change notification settings - Fork 2
Output folders and files
Documentation update in progress 28 November 2022
Your top-level results folder will be called results
, unless you provided an alternative name via the --outdir <directory_name>
option. It will contain a number of subfolders, which are described below.
-
01_namelist
This folder contains the text file
namelist.txt
, containing a list of sample names. These names are derived from the common prefix preceding the first underscore (_
) in your read files. For example, if your read files are named:79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R1.fastq 79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L001_R2.fastq 79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R1.fastq 79678_LibID81729_HF7CKAFX2_TGAATGCC-TGTCTAGT_L002_R2.fastq 79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R1.fastq 79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L001_R2.fastq 79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R1.fastq 79679_LibID81730_HF7CKAFX2_GCAACTAT-TCGTTGAA_L002_R2.fastq
...then
namelist.txt
will contain:79678 79679
-
02_reads_combined_lanes
This folder will only be present if you've run the pipeline with the
--combine_read_files
flag. It contains read files that have been grouped and concatenated via a common file name prefix, defaulting to the text preceding the first underscore (_
) in file names. This is useful if your samples have been run across multiple lanes, and you need to combine all forward and reverse files (separately, of course) for each sample prior to analysis. For example, given the reads set above, this folder will contain the files:79678_combinedLanes_R1.fastq 79678_combinedLanes_R2.fastq 79679_combinedLanes_R1.fastq 79679_combinedLanes_R2.fastq
See here for more details.
-
03a_trimmomatic_logs
This folder will only be present if the pipeline was run with the
--use_trimmomatic
flag. It contains a Trimmomatic*.log
file for each sample. These log files record the Trimmomatic primers and run parameters, as well as the number of reads passing QC. For example:TrimmomaticPE: Started with arguments: -phred33 -threads 10 79678_combinedLanes_R1.fastq 79678_combinedLanes_R2.fastq 79678_combinedLanes_R1_forward_paired.fq 79678_combinedLanes_R1_forward_unpaired.fq 79678_combinedLanes_R2_reverse_paired.fq 79678_combinedLanes_R2_reverse_unpaired.fq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36 Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT>PE1TACACTCTTTCCCTACACGACGCTCTTCCGATCT' Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA>PE2GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA' ILLUMINACLIP: Using 1 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs: 336367 Both Surviving: 297073 (88.32%) Forward Only Surviving: 24179 (7.19%) Reverse Only Surviving: 6362 (1.89%) Dropped: 8753 (2.60%) TrimmomaticPE: Completed successfully
-
03b_trimmomatic_paired_and_single_reads
This folder will only be present if the pipeline was run with the
--use_trimmomatic
flag. If a directory of paired-end read files was used a pipeline input, this folder will contain a forwards and reverse file for paired reads passing the Trimmomatic QC, and a file of orphaned reads (i.e. those reads with a mate that didn't pass QC). For example:79678_combinedLanes_R1_forward_paired.fq 79678_combinedLanes_R2_reverse_paired.fq 79678_combinedLanes_R1-R2_unpaired.fq
-
03c_trimmomatic_single_reads
This folder will only be present if the pipeline was run with the
--use_trimmomatic
flag. If a directory of single-end read files was used a pipeline input, this folder will contain a file of reads passing the Trimmomatic QC. For example:79678_trimmed_single.fastq
-
04_processed_gene_directories
This folder contains a subfolder for each of your samples, corresponding to the output of running the original HybPiper command
hybpiper assemble
. The contents of these folders are described in the HybPiper wiki here. -
05_visualise
This folder contains:
-
recovery_heatmap.png
. A heatmap image in.png
format showing gene recovery and length per sample, output by the HybPiper commandhybpiper recovery_heatmap
. -
paralog_heatmap.png
. A heatmap image in.png
format depicting the number of long sequences recovered for each gene and sample, output by the HybPiper commandhybpiper paralog_retriever
.
-
-
06_summary_stats
This folder contains:
-
hybpiper_stats.tsv
. A table in tab-separated-values format, containing statistics on the HybPiper run, output from the HybPiper commandhybpiper stats
. Column details are described in the HybPiper tutorial here. -
seq_lengths.tsv
. A table in tab-separated-values format, containing the lengths of each recovered gene sequence for each sample, along with the mean sequence length for each gene within the target file, output by the HybPiper commandhybpiper stats
. -
A folder for each sample e.g.
79678
containing:-
79678_genes_stitched_contig.csv
. A comma-separated-values file collated from per-gene stitched contig reports. See above for a more complete description. -
79678_genes_derived_from_putative_chimeric_stitched_contig.csv
. A comma-separated-values file collated from per-gene putative chimeric stitched contig reports. See above for a more complete description. -
79678_genes_with_long_paralog_warnings.txt
. A text file listing genes that had long paralog warnings. -
79678_genes_with_paralog_warnings_by_contig_depth.csv
. A text file listing genes that had paralog warnings paralog warnings due to contig depth).
-
-
-
07_sequences_dna
This folder contains a nucleotide
.fasta
file for each gene in your target file, here with the suffix*.FNA
. Each file contains the 'main' nucleotide sequence selected by HybPiper for this gene for all samples (where available). These sequences are not aligned. -
08_sequences_aa
This folder contains an amino-acid
.fasta
file for each gene in your target file, here with the suffix*.FAA
. Each file contains a protein translation of the 'main' nucleotide sequence selected by HybPiper for this gene for all samples (where available). These sequences are not aligned. -
09_sequences_intron
If the pipeline was run with the
--run_intronerate
flag, this folder contains*_intron.fasta
files for each gene in your target file. Each file contains putative introns from each sample, as recovered by the HybPiper commandhybpiper assemble --run_intronerate
(see here) followed by the commandhybpiper retrieve_sequences
with the optionintron
. Manually review these sequences and use with caution. -
10_sequences_supercontig
If the pipeline was run with the
--run_intronerate
flag, this folder contains*_supercontig.fasta
files for each gene in your target file. Each file contains a 'supercontig' from each sample, as recovered by the HybPiper commandhybpiper assemble --run_intronerate
(see here) followed by the scripthybpiper retrieve_sequences
with the optionsupercontig
. Manually review these sequences and use with caution. -
11_paralogs
This folder contains a nucleotide
*_paralogs.fasta
file for each gene in your target file. Each file contains the sequence** found in folder07_sequences_dna
, along with any additional sequences that HybPiper has flagged as putative paralogs. If such putative paralogs are present for a given sample, the.fasta
header of the 'main' sequence selected by HybPiper will contain the suffix.main
, whereas paralog sequence header will have the suffix.0
,.1
etc. For example:>79679.0 NODE_1_length_1884_cov_18.097952,UMUL-7324,0,231,91.34,(+),422,1471 ATGATGATGTGAGGTATGAGTGTGAAGAACTTTGATCCAGTCCGACATGCTGGAAGA... >79679.main NODE_2_length_1613_cov_13.093099,UMUL-7324,0,231,93.07,(+),171,1229 ATGATGATGAGAGGTATGAGTGTGAAGAACTTTGATCCAGTCCGATATTCTGGGAGA... >79678.main NODE_1_length_1937_cov_18.187097,UMUL-7324,0,231,92.21,(+),372,1431 ATGATGATGAGAGGTATGAGTGTGAAGAACTTTGATCCAGTCCGATATTCTGGGAGA... >79678.1 NODE_2_length_1636_cov_16.651700,UMUL-7324,13,231,88.07,(-),1400,453 GTCCGATATGCTGGAAGATGGTTCGAGGTAGCTTCCCTTAAACGTGGGTTTGCTGGT...
Note that the text following the space in the fasta headers (e.g.
NODE_1_length_1884_cov_18.097952,UMUL-7324,0,231,91.34,(+),422,1471
) corresponds to the sequence description, and is not part of the sequence name. It includes the SPAdes contig name, and details from Exonerate searches conducted as part of the HybPiper pipeline. You can ignore this information in this case.-
logs
. A directory containing the files:-
paralog_report.tsv
. A table in tab-separated-values format, containing the number of long sequences recovered for each gene and sample (i.e. potential paralogs if > 1) -
paralogs_above_threshold_report.txt
. A text file that lists 1) The number and names of genes with paralogs in a minimum percentage of samples; 2) The number and names of samples that have paralogs in a minimum percentage of genes. By default, this percentage is set to zero, so all genes and samples with paralogs (i.e. > 1 sequence) will be reported.
-
** NOTE: if paralogs are detected for a given gene/sample, the sequence in the paralog file with the suffix *.main will not necessarily be identical to the corresponding *.FNA sequence. This is because each paralog sequence is recovered from a single SPAdes contig only, whereas the *.FNA sequence could be derived from a stitched contig (comprising sequence from more than one SPAdes contig).
-
-
12_paralogs_noChimeras
As described here, this folder contains the same files/sequences as found in folder
11_paralogs
, but with putative chimeric sequences removed.