-
Notifications
You must be signed in to change notification settings - Fork 45
Paralogs
##Assessing Paralogs
When running the main script, you may notice occasional warnings about paralogs. These warnings are triggered when HybPiper detects multiple contigs containing long coding sequences-- by default at least 85% of the reference sequence. HybPiper will choose among these competing long contigs by first checking whether one of the contigs has coverage depth that greatly exceeds the others (10x by default). If all competing long contigs have similar depth, the sequence with the greatest percent identity to the reference is chosen.
These criteria may not be ideal in all cases, especially in the event of gene (or genome) duplication. Choosing the appropriate gene copy to use for phylogenetics requires careful consideration, and HybPiper flags the genes that may require further attention. Note that there may be other reasons for multple long-length contigs: recent polyploidy, contamination, or even allelic variation may result in multiple reads.
HybPiper includes a post-processing script, paralog_investigator.py
to extract coding sequences from alternative contigs. To run it on the test data, use the namelist.txt
file and a while loop:
while read i
do
echo $i
python ../paralog_investigator.py $i
done < namelist.txt
The script will report the number of paralogs found for each gene. If many samples have more than one copy for many genes, it may indicate an ancient gene duplication. If one sample tends to have many copies, it may indicate it is a polyploid.
Paralog coding sequences are extracted into a file located at: prefix/gene/prefix/paralogs/gene_paralogs.fasta
Not all samples have paralog warnings, such as EG30. Some genes do not have any paralogs found, such as gene001. However, HybPiper recovered coding sequence from two contigs in the sample EG98 for gene002:
>EG98.0 NODE_4_length_1358_cov_56.9293_ID_43,Artocarpus-gene002,17,282,89.27,(-),1358,572
>EG98.main NODE_3_length_1380_cov_111.618_ID_41,Artocarpus-gene002,0,282,91.49,(+),17,869
The main
refers to the paralog that was chosen during the initial run of HybPiper. The original names of the contigs from SPAdes is also shown. The main
contig had about twice as much depth of coverage as the other paralog (111x vs 57x), and also had a higher percent identity (91.49% vs 87.27%).
Are these two sequences paralogs or alleles? The best way to check is to use multiple samples and build gene trees from the sequences. HybPiper includes the script paralog_retriever.py
for this task. paralog_retriever.py
will collect all paralogs from each sample in namelist.txt
, along with all coding sequences from samples without paralogs, and output them to "standard output." If you have a list of genes for which you want to assess paralogs, you can use GNU Parallel:
parallel "python paralog_retriever.py namelist.txt {} > {}.paralogs.fasta" ::: gene002 gene006 gene030
The output to the screen should look like this:
gene030 1 6 4 1 5 1 4 4 4
gene006 1 2 2 1 2 1 2 2 2
gene002 1 2 2 1 2 1 2 2 2
This shows the number of paralogs for each gene. The columns are in the same order as namelist.txt
.
The unaligned FASTA files generated by paralog_retriever.py
can be used in a phylogenetics pipeline to align and reconstruct a phylogeny. If you have mafft
and FastTree
installed, you can create the tree directly from paralog_retriever.py
using pipes:
python ../paralog_retriever.py namelist.txt gene074 | mafft --auto - | FastTree -nt -gtr > gene074.paralogs.tre
From the tree above (plotted using FigTree), you can see that for each species, the two sequences recovered form a clade. This means there is no evidence of an ancient duplication event for this gene. There may be another explanation for why HybPiper commonly found two competing sequences, such as alleles. For phylogenetics, choosing the main
sequence for each species is sufficient.
Other genes have more complicated histories:
python ../paralog_retriever.py namelist.txt gene002 | mafft --auto - | FastTree -nt -gtr > gene002.paralogs.tre
We see that there are two distinct clades of sequences, sister to a clade of the outgroups (NZ874, EG30, and NZ281). The main
sequences form one clade, meaning that this gene had an ancestral duplication within the outgroup.
Cases like the one above were common in the Artocarpus data on which the test dataset is based. By investigating each gene tree, it is possible to delineate which sequence is appropraite for phylogenetics. However, some samples may be missing one or both copies of the duplicated gene. Although the genes may really be lost, it is also possible that HybPiper was not able to recover both copies.
To test this possibility, one option is to create a new target file for HybPiper that includes examples of both copies. HybPiper will map reads against each copy separately, treating them as separate loci. For the Artocarpus dataset, this allowed us to nearly double the number of "loci" for phylogenetics!