Skip to content

davebx/kc-align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Kc-Align is a fast and accurate tool for performing codon-aware multiple sequence alignments (MSA). It makes use of the very fast multiple alignment program Kalign3 to ensure maximum speed. Kc-Align is a extremely extremely versatile tool, capable of taking a variety of inputs and achieving the same result. Every other aligner requires the sequence inputs to be the coding sequences of the gene/open reading frame (ORF) to be aligned, requiring curation from the whole-genome sequences and also preventing use of assemblies that may not be properly annotated. Kc-Align solves this problem by using pairwise alignments to extract the sequence from each whole genome that is homologous to the sequence of a high quality and well annotated reference sequence. This feautre may also be bypassed for those who already have curated data and simply desire a quick and accurate codon-aware multiple aligner (see Modes below).

What Is a Codon-Alignment?

In a codon-aware alignment, coding sequences of nucleotides are converted to their amino acid sequences before being aligned and then following the alignment, the original nucleotide sequence of each sequence in the alignment is restored, converting any gap character that was inserted into three consecutive gaps to represent to inserted or deleted codon. This method of alignment prevents synonymous mutations from affecting the alignment while also preserving them for downstream analyses such as dN/dS calculations.

What Can Kc-Align be Used For?

Kc-Align can be used to produce accurate and high quality MSA that can be used for various bioinformatic analyses such as homology modeling, phylogenetic reconstruction, and evolutionary selection analysis. These downstream analyses are heavily affected by the quality of the alignment that they are given and by aligning sequences on a codon-level, alignments from Kc-Align are able to produce more accurate downstream results.

Obtaining Kc-Align

Kc-Align is availbe through PyPI (pip install kcalign) and through Bioconda (conda install kcalign). Alternatively, a GUI interface for Kc-Align is installed and available for use on Galaxy (http://usegalaxy.org).

Using Kc-Align

USAGE:

kc-align --mode genome --reference [reference sequence] --sequences [other seqs to align] --start [start coordinate] --end [end coordinate]

(or)

kc-align --mode [gene | mixed] --reference [reference sequence] --sequences [other seqs to align]

Arguments:

--mode/-m         Alignment mode (genome, gene, or mixed) (required)

--reference/-r    Reference sequences to align against (required)

--sequences/-S    Other sequences to align (required)

--start/-s        1-based start position (required in genome mode)

--end/-e          1-based end position (required in genome mode)

--compress/-c     Compress identical sequences

--parallel/-p     Enable parallelization of homology search

--table/-t        Choose an alternative translation table (See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for values)

Modes

Kc-Align can be run in three different modes, depending on your input data. These modes are: genome, gene, and mixed.

Genome

If the reference sequences and all other sequences to be aligned are full genomes, this mode should be used. This mode requires the start and end coordinates of the gene/ORF to be aligned with regard to the reference sequence. Kc-Align will excise the appropriate subsequence from the referencce and then use pairwise alignments to find the corresponding homologous sequences from each of the other input genomes. It will then perform the MSA using the extracted sequences.

If the gene/ORF you wish to align exists in multiple distinct segments in the genome (ribosomal frameshift during transcription), Kc-Align can find each homologous segment separately for each input sequence and then concatenate them before performing the final MSA. This requires the user to enter the start and end coordinates of each segment as a comma-separated list following their respective arguments.

Example

kc-align -m genome -r reference.fasta -S sequences.fasta -s 3532,3892 -e 3894,5326

In the above example, the two segments being aligned have the coordinates 3532-3894 and 3892-5326.

Gene

If the input sequences have already been trimmed to the coding sequences of the gene/ORF of interest, then this mode can be used and Kc-Align will not perform the homology search as it does in genome mode. It instead simply immediately performs the MSA, making this mode much faster than the others.

Mixed

For the case when your reference is a coding sequence while all other sequences are whole genomes. Like gene mode, this mode does not require the start and end point position parameters but like genome mode it will perform homology searching in order to extract the sequences homologous to the reference from the other input sequences.

Outputs

Kc-Align will output two files: a FASTA format alignment and a CLUSTAL format alignment.

Compress Identical Sequences

If the --compress/-c parameter is specified, Kc-Align will compress identical sequences into a single sequence. In the FASTA output, compressed sequences will have an ID of the form MultiSeq[incremental index]_[number of sequences that were compressed] (ex: MultiSeq3_321, third compression with 321 sequences having that same sequence) while the description field is a comma-separated list of every ID that was compressed into that single sequence. The reference sequence will not be compressed.

Parallelization

If the --parallel/-p parameter is used in genome or mixed mode, the calculations for the homology search will be split between 3 cores (if possible), decreasing runtimes by up to 35%.

Kc-Align In Action

TODO

Will post results from SARS-CoV-2 analysis

Speed

TODO

Speed test results