-
Notifications
You must be signed in to change notification settings - Fork 45
Results and output files
Documentation current for HybPiper version 2.3.1
-
1.0
hybpiper assemble
- Output Directory
- Base Directory
- Base Directory -> Gene Directory
- Base Directory -> Gene Directory -> SPAdes Directory
- Base Directory -> Gene Directory -> Exonerate Directory
- Base Directory -> Gene Directory -> Exonerate Directory -> Sequences Directory
- Base Directory -> Gene Directory -> Exonerate Directory -> Paralogs Directory
- Base Directory -> Gene Directory -> Exonerate Directory -> Intronerate Directory
-
2.0
hybpiper stats
-
3.0
hybpiper retrieve_sequences
-
4.0
hybpiper filter_by_length
-
5.0
hybpiper paralog_retriever
-
6.0
hybpiper recovery_heatmap
- 7.0
hybpiper check_dependencies
- 8.0
hybpiper check_targetfile
-
9.0
hybpiper fix_targetfile
Optional. The parent output directory if supplied using the parameter --hybpiper_output
or -o
.
The name of the base directory is specified by supplying the parameter --prefix
to the hybpiper assemble
command. If --prefix
is not provided, it is generated from the read file names.
- The master target file (e.g.
target_file.fasta
). -
translated_target_file.fasta
. A fasta file with amino-acid sequences, translated from a nucleotide target file. Note that this is only present if a nucleotide target file was supplied, but the flag--bwa
was not used. -
check_targetfile_report-<target_file_name>.txt
. A text report file summarising details of the target file check performed. Note that this is only present if flag--skip_targetfile_checks
is not used. - A BLAST (
<target_file_name>.psq
, etc.), DIAMOND (<target_file_name>.dmnd
) or BWA database (<target_file_name>.amb
, etc.). - A BLAST/DIAMOND (
<prefix>.blastx
) or BWA (<prefix>.bam
) mapping results file. - A directory for every gene with BLAST/DIAMOND or BWA hits, e.g.
gene001
,gene002
, etc. -
target_tallies.txt
. A text file summarizing the chosen target reference sequences for the sample run. -
spades_initial_commands.txt
. A text file listing the spades.py commands used to assemble reads from each gene. -
gnu_parallel_log.txt
. A text log file produced by GNUparallel
when running SPAdes gene assemblies. -
gnu_parallel_log.txt
. A text log file produced by GNUparallel
when running SPAdes gene assemblies. -
spades_genelist.txt
. A text file listing all genes with mapped reads. -
exonerate_genelist.txt
. A text file listing all genes with assembled SPAdes contigs. Note that this file is calledexonerate_genelist.txt
even if BLAST was used to extract sequences (i.e. option--not_protein_coding
was used). -
genes_with_seqs.txt
. A text file listing all genes for which a coding sequence was extracted via Exonerate. -
<prefix>_chimera_check_performed.txt
. A text file containing 'True' or 'False' depending on whether the option--skip_chimeric_genes
was provided to commandhybpiper assemble
. Used byhybpiper retrieve_sequences
andhybpiper paralog_retriever
. -
<prefix>_genes_with_non_terminal_stop_codons.txt
. A text log file containing gene names for any gene with an output sequence containing one or more internal (i.e., non-terminal) stop codons. -
<prefix>_genes_with_long_paralog_warnings.txt
. A text file listing all genes which had multiple long-length sequences from different SPAdes contigs (putative paralogs). -
<prefix>_genes_with_paralog_warnings_by_contig_depth.csv
. A comma-separated-values file listing all genes that had a SPAdes contig depth >1 for at least 75% (default) the length of the reference target file sequence. -
<prefix>_genes_with_stitched_contig.csv
. A comma-separated-values file with details on whether a stitched contig was created for a given gene. -
<prefix>_genes_derived_from_putative_chimera_stitched_contig.csv
. A comma-separated-values file listing all genes that might be derived from a chimeric stitched contig (i.e. comprising multiple paralogs). -
<prefix>_hybpiper_assemble_<date_time>.log
. A text log file containing many details regarding the pipeline run for the sample. -
spades.log
. A text log file containing the concatenated output of the SPAdes assembler for initial SPAdes assemblies for all genes. -
failed_spades.txt
. A text file listing all genes that had a failed initial SPAdes assembly. -
redo_spades_commands.txt
. A text file containing commands to re-run SPAdes for genes with a failed initial assembly. -
spades_redo.log
. A text log file containing the concatenated output of the SPAdes assembler for SPAdes re-runs. -
spades_duds.txt
. A text file listing all genes with failed SPAdes re-runs. -
total_input_reads_paired.txt
. A text file containing the number of paired-end reads (if supplied) in the input read files. -
total_input_reads_single.txt
. A text file containing the number of single-end reads (if supplied) in the input read files. -
total_input_reads_unpaired.txt
. A text file containing the number of unpaired reads (if supplied) in the input read files.
The gene directories will be named according the unique gene names present in the target file used for the run.
-
<gene_name>_interleaved.fasta
. A fasta file containing all reads provided using the--readfiles
or-r
parameter that mapped to any target sequence for this gene. In cases where only one read of a read pair mapped, both R1 and R2 reads are included in this file. If paired-end reads files were used as input, this fasta file is in interleaved format; not that this file will be have the suffixinterleaved.fasta
even if you provide single-end reads. -
<gene_name>_merged.fastq
. A fastq file of merged reads from paired-end input. This file will only be present if the flag--merged
is used with thehybpiper assemble
command and paired-end reads are provided. -
<gene_name>_unmerged.fastq
. A fastq file of paired-end reads that could not be merged. in interleaved format. This file will only be present if the flag--merged
is used with thehybpiper assemble
command and paired-end reads are provided. -
<gene_name>_unpaired.fasta
. A fasta file containing all reads provided using the--unpaired
parameter that mapped to any target sequence for this gene. -
<gene_name>_contigs.fasta
. The contigs assembled from the input read using SPAdes. -
<gene_name>_target.fasta
. A fasta file with the amino-acid sequence of the 'best' reference target for the given gene/sample. -
<gene_name>_<date_time>.log
. The log file produced by theexonerate_hits.py
module for the given gene/sample. This will only be present if the flag--keep_intermediate_files
was provided to the commandhybpiper assemble
; default behaviour is to delete the log file after it has been re-logged to the main sample logfile in the base directory. -
<sample_name>
. A directory of Exonerate results; the directory has the same name as the sample. See below for details. -
<gene_name>_spades
. The directory produced by the SPAdes assembler for the given gene/sample. See below for details.
The SPAdes assembly directory is produced by the SPAdes assembler; in this case it will have a prefix corresponding to the given gene name, i.e. <gene_name>_spades
. This directory will only be present if the flag --keep_intermediate_files
was provided to the command hybpiper assemble
; default behaviour is to delete the directory after processing. It contains standard SPAdes output files and folders as described here.
The Exonerate directory will have the same name as the base directory (i.e. the sample name), and contains output files and folder produced by the exonerate_hits.py
module.
-
exonerate_results.fasta
. The output of the initial Exonerate search of the target protein against the SPAdes contigs. This file contains both Exonerate alignments, and fasta sequence for the extracted coding region. -
exonerate_stats.tsv
. A table in tab-separated-values format, containing information on SPAdes contigs with Exonerate hits against the 'best' reference target sequence, if they passed the initial global similarity filter set by--thresh
. -
exonerate_hits_trimmed.FAA
. A fasta file containing amino-acid sequences of one or more Exonerate hits used to create the output gene sequence. -
exonerate_hits_trimmed.FNA
. A fasta file containing nucleotide sequences of one or more Exonerate hits used to create the output gene sequence. -
genes_with_stitched_contig.csv
. A file in comma-separated-values format, providing details on whether the given gene/sample sequence was derived from a stitched contig. -
paralog_warning_long.txt
. A text file produced if the given gene/sample had 'long' paralog warnings, listing the corresponding SPAdes contigs along with Exonerate hit details. -
paralog_warning_by_contig_depth.txt
. A text file detailing whether the given gene/sample has a paralog warning produced by sequence depth across the reference target sequence after Exonerate searches. -
chimera_test_stitched_contig.fasta
. A fasta file containing a stitched contig nucleotide sequence, used for read mapping during the chimera test. -
chimera_test_stitched_contig.sam
. A mapping file in Sequence Alignment Map (SAM) format, produced by mapping paired-end reads against thechimera_test_stitched_contig.fasta
sequence. -
putative_chimeric_stitched_contig.csv
. A file in comma-separated-values format, produced if a stitched contig for the given gene/sample appears to be chimeric. Lists the sample name, gene name, and chimera warning details. -
chimera_test_diagnostic_reads.sam
A headless mapping file in Sequence Alignment Map (SAM) format, produced by filtering thechimera_test_stitched_contig.sam
file to retain read pairs diagnostic for a chimeric stitched contig. -
sequences
. A directory containing subdirectories with recovered sequences. See below for details. -
intronerate
. A directory containing intron and supercontig processing results. See below for details. -
paralogs
. A directory containing paralog sequence results, if present. See below for details.
If option --not_protein_coding
is used:
This directory will contain BLASTn output files rather than Exonerate output, as follows:
-
blastn_results.xml
. The output of the BLASTn search of the target sequence against the SPAdes contigs, in*.xml
format (blastn-outfmt 5
). -
blast_stats.tsv
. A table in tab-separated-values format, containing information on SPAdes contigs with BLASTn hits against the 'best' reference target sequence, if they passed the initial global similarity filter set by--thresh
. -
blast_hits_trimmed.FNA
. A fasta file containing nucleotide sequences of one or more BLASTn hits used to create the output sequence. -
genes_with_stitched_contig.csv
. A file in comma-separated-values format, providing details on whether the given locus/sample sequence was derived from a stitched contig. -
paralog_warning_long.txt
. A text file produced if the given locus/sample had 'long' paralog warnings, listing the corresponding SPAdes contigs along with BLASTn hit details. -
paralog_warning_by_contig_depth.txt
. A text file detailing whether the given locus/sample has a paralog warning produced by sequence depth across the reference target sequence after BLASTn searches. -
chimera_test_stitched_contig.fasta
. A fasta file containing a stitched contig nucleotide sequence, used for read mapping during the chimera test. -
chimera_test_stitched_contig.sam
. A mapping file in Sequence Alignment Map (SAM) format, produced by mapping paired-end reads against thechimera_test_stitched_contig.fasta
sequence. -
putative_chimeric_stitched_contig.csv
. A file in comma-separated-values format, produced if a stitched contig for the given locus/sample appears to be chimeric. Lists the sample name, gene name, and chimera warning details. -
chimera_test_diagnostic_reads.sam
A headless mapping file in Sequence Alignment Map (SAM) format, produced by filtering thechimera_test_stitched_contig.sam
file to retain read pairs diagnostic for a chimeric stitched contig. -
sequences
. A directory containing subdirectories with recovered sequences. See below for details. -
paralogs
. A directory containing paralog sequence results, if present. See below for details
The directory sequences
contains subdirectories containing fasta files with recovered sequences, as follows:
-
FAA
. A directory containing the fasta file<gene_name>.FNA
with the recovered gene sequence in amino-acids. -
FNA
. A directory containing the fasta file<gene_name>.FNA
with the recovered gene sequence in nucleotides. -
intron
. A directory containing the fasta files<gene_name>_introns.fasta
and<gene_name>supercontig.fasta
. These files contain recovered intron sequence, and the recovered supercontig sequence (the latter containing both introns and exons), of recovered for the gene/sampke. This directory will only be present if the flag--run_intronerate
was provided to the commandhybpiper assemble
.
If option --not_protein_coding
is used:
-
FNA
. A directory containing the fasta file<gene_name>.FNA
with the recovered locus sequence in nucleotides.
The directory paralogs
contains the fasta file <gene_name>_paralogs.fasta
with paralog sequences, if recovered for the gene/sample.
The directory intronerate
will only be present if the flag --run_intronerate
was provided to the command hybpiper assemble
and --not_protein_coding
was not used. It contains output files produced by Intronerate (the process used to recover introns and supercontigs, if present for the gene/sample).
-
intronerate_query_stripped.fasta
. A fasta file containing the recovered gene sequence in amino-acid format, with and 'X' characters removed. Used as a query in Exonerate searches to generate a gff file. -
<gene_name>_supercontig_without_Ns.fasta
. A fasta file containing a supercontig (i.e. exons and introns) for the given gene/sample. Used as a target in Exonerate searches to generate a gff file. -
<gene_name>_intronerate_supercontig_individual_contig_hits.fasta
. A fasta file containing the individual SPAdes contigs used to create the supercontig sequence. -
<gene_name>_intronerate_fasta_and_gff.txt
. A text file containing both Exonerate search alignment and gff details. -
intronerate.gff
. The gff details only, extracted from the<gene_name>_intronerate_fasta_and_gff.txt
file.
The parent directory contains one or more Base directories corresponding to the output of hybiper assemble
for each sample. The descriptions below assume that the command hybpiper stats
has been run from the parent directory.
-
seq_lengths.tsv
. A table in tab-separated-values format, containing the lengths of each recovered gene sequence for each sample, along with the mean sequence length for each gene within the target file. The name of this file can be changed using the parameter--seq_lengths_filename <filename>
. -
hybpiper_stats.tsv
. A table in tab-separated-values format, containing statistics on the HybPiper run. The name of this file can be changed using the parameter--stats_filename <filename>
.
The parent directory contains one or more Base directories corresponding to the output of hybiper assemble
for each sample. The descriptions below assume that the command hybpiper retrieve_sequences
has been run from the parent directory.
-
<gene_name>.FNA
. A fasta file containing the recovered gene sequence from each sample in nucleotides (if parameterdna
was supplied). A fasta file will be produced for each gene. -
<gene_name>.FAA
. A fasta file containing the recovered gene sequence from each sample in amino-acids (if parameteraa
was supplied). A fasta file will be produced for each gene. -
<gene_name_introns>.fasta
. A fasta file containing the recovered gene intron sequence from each sample in nucleotides (if parameterintron
was supplied). A fasta file will be produced for each gene. -
<gene_name_supercontig>.fasta
. A fasta file containing the recovered gene supercontig sequence (exons and introns) from each sample in nucleotides (if parametersupercontig
was supplied). A fasta file will be produced for each gene.
If the parameter --fasta_dir <directory_name>
is provided, the directory will be created and the fasta files described above will be placed within it, rather than in the parent directory.
The parent directory contains one or more Base directories corresponding to the output of hybiper assemble
for each sample. The descriptions below assume that the command hybpiper retrieve_sequences
has been run from the parent directory.
-
<gene_name>.filtered.FNA
. A fasta file containing the gene sequence from each sample in nucleotides (if parameterdna
was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene. -
<gene_name>.filtered.FAA
. A fasta file containing the recovered gene sequence from each sample in amino-acids (if parameteraa
was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene. -
<gene_name_introns>.filtered.fasta
. A fasta file containing the recovered gene intron sequence from each sample in nucleotides (if parameterintron
was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene. -
<gene_name_supercontig>.filtered.fasta
. A fasta file containing the recovered gene supercontig sequence (exons and introns) from each sample in nucleotides (if parametersupercontig
was supplied), filtered according to the length filtering options provided. A fasta file will be produced for each gene.
If the parameter --filtered_dir <directory_name>
is provided, the directory will be created and the fasta files described above will be placed within it, rather than in the parent directory.
The parent directory contains one or more Base directories corresponding to the output of hybiper assemble
for each sample. The descriptions below assume that the command hybpiper paralog_retriever
has been run from the parent directory.
-
paralog_report.tsv
. A table in tab-separated-values format, containing the number of long sequences recovered for each gene and sample (i.e. potential paralogs if > 1) -
paralog_heatmap.png
. A heatmap image file in*.png
format, depicting the number of long sequences recovered for each gene and sample. The name of this file can be changed using the parameter--heatmap_filename <filename>
. The format of the file can be changed using the parameter--heatmap_filetype {png,pdf,eps,tiff,svg}
. -
paralogs_above_threshold_report.txt
. A text file that lists 1) The number and names of genes with paralogs in a minimum percentage of samples; 2) The number and names of samples that have paralogs in a minimum percentage of genes. By default, this percentage is set to zero, so all genes and samples with paralogs will be reported. -
paralogs_all
. A directory containing a*.fasta
file for each sample/gene, containing paralog sequences if present, or the *.FNA sequence recovered by HybPiper is no paralogs were detected. -
paralogs_no_chimeras
. A directory containing a*.fasta
file for each sample/gene as above, but with any putative chimeric*.FNA
sequences removed. This folder will only be present if at least one of your samples had a chimera check performed duringhybpiper assemble
(i.e. the option--chimeric_stitched_contig_check
was provided).
The parent directory contains one or more Base directories corresponding to the output of hybiper assemble
for each sample. The descriptions below assume that the command hybpiper recovery_heatmap
has been run from the parent directory.
-
recovery_heatmap.png
. A heatmap image file in*.png
format, depicting the length of the recovered sequence for each gene and each sample, relative to the mean length of the gene sequence references in the target file. The name of this file can be changed using the parameter--heatmap_filename <filename>
. The format of the file can be changed using the parameter--heatmap_filetype {png,pdf,eps,tiff,svg}
.
No output files are produced by this command. Results are printed to the terminal screen.
In addition to results printed to the terminal screen, the following file is produced:
-
fix_targetfile_<date_time>.ctl
. A control file in text format, logging parameters of thehybpiper check_targetfile
run, as well as a list of target file sequence names for sequences with low-complexity regions. This*.ctl
file is required as input for thehybpiper fix_targetfile
command (see below).
In addition to results printed to the terminal screen, the following files are produced:
-
<targetfile_name>_fixed.fasta
. A fasta file containing filtered and/or fixed target sequences. -
fix_targetfile_report.tsv
. A table in tab-separated-values format, containing a list of sequences that were removed from the input target file, and a corresponding reason. Note that this list can include multiple frames for a single input sequence (suffix_frame_1
,_frame_2
, etc.). -
fix_targetfile_<date_time>.log
. A text log file containing details of thehybpiper fix_targetfile
run.
The directory fix_targetfile_alignments
will only be present if the flag --alignments
was provided to the command hybpiper fix_targetfile
. It contains directories with per-gene unaligned and aligned fasta files, from the trimmed/filtered targetfile. By default, this directory will not be created.
-
translated_gene_seqs_unaligned
. A directory containing unaligned fasta files<gene_name>_unaligned.fasta
with translated, unaligned, per-gene fixed target file sequences. Only present if the input target file contains nucleotide sequences. -
translated_gene_seqs_aligned
. A directory containing aligned fasta files<gene_name>_aligned.fasta
with translated, aligned, per-gene fixed target file sequences. Only present if the input target file contains nucleotide sequences. -
protein_gene_seqs_unaligned
. A directory containing fasta files<gene_name>_unaligned.fasta
with aligned per-gene fixed target file sequences. Only present if the input target file contains protein sequences. -
protein_gene_seqs_aligned
. A directory containing fasta files<gene_name>_aligned.fasta
with aligned per-gene fixed target file sequences. Only present if the input target file contains protein sequences.
The directory fix_targetfile_additional_sequence_files
will only be present if the flag --write_all_fasta_files
was provided to the command hybpiper fix_targetfile
. It contains fasta files for sequences removed from the fixed target file, grouped according to filtering categories (length threshold, low-complexity regions, etc.). By default, these files will not be written.
-
<targetfile_name>_low_complexity_regions.fasta
. A fasta file containing all sequences listed as having low-complexity regions in the*.ctl
file, regardless of whether they were removed from the fixed targetfile or not. -
<targetfile_name>_short_sequences.fasta
. A fasta file containing sequences shorter than the "--filter_by_length_percentage" threshold, when compared to the longest representative gene sequence. -
<targetfile_name>_stop_codons_all_frames.fasta
. A fasta file containing sequences with unexpected stop codons in all forward frames. -
<targetfile_name>_undetermined_frame.fasta
. A fasta file containing sequences with multiple candidate forward reading frames, but no reference sequence to select the 'correct' candidate. Each candidate frame is present as a unique sequence. -
<targetfile_name>_exceeding_maximum_distance_frames_multi.fasta
. A fasta file containing sequences with multiple candidate forward reading frames and a corresponding reference sequence, but all frames exceeded the maximum allowed distance threshold from the reference. -
<targetfile_name>_exceeding_maximum_distance_frames_single.fasta
. A fasta file containing sequences with a single candidate forward reading frame and a corresponding reference sequence, but the frame exceeded the maximum allowed distance threshold from the reference.