Skip to content

Alignments and Alignment Trees

Robert J. Gifford edited this page Nov 27, 2024 · 3 revisions

One of the key features of GLUE is its ability to compute, store and query multiple sequence alignments, represented by Alignment objects in the data model. In GLUE, Alignments may be constrained or unconstrained. Unconstrained alignments are best for recording homology between a small number of distantly-related sequences, whereas constrained alignments are good for capturing the homology between a large number of closely-related sequences. Constrained alignments may also be assembled into alignment trees which combine homology information together in a phylogenetic data structure.

  1. Unconstrained alignments
  2. Constrained alignments
  3. Alignment trees

Unconstrained Alignments

Unconstrained Alignments are used in GLUE projects to store the results of any process aimed at identifying homologies between nucleotides. This includes classical algorithms such as MUSCLE or MAFFT but also manual techniques. In unconstrained Alignments the reference coordinate space is purely notional, and not based on any particular Sequence. Nucleotide position columns in this coordinate space may be added in an unrestricted way in order to accommodate any pairwise homology between member Sequences.

The example unconstrained Alignment shown below contains three AlignmentMembers. Each AlignmentMember contains multiple AlignedSegments; these map between the member sequence coordinates and the reference coordinate system. For example a block of 7 nucleotides starting at position 5 in Member 3 is mapped to reference coordinates [1,7]. The reference coordinate space has been expanded to allow for example insertions present in Member 2 at locations [8,10] and [24,26] in the reference space, and an insertion present in Members 1 and 3 at location [16,18].


Constrained Alignments

ReferenceSequence in GLUE can provide a concrete coordinate space in which nucleotide data may be interpreted. Alignment objects may be constrained to a ReferenceSequence. This association is made at the time an Alignment is created and is immutable for the lifetime of the Alignment. The nucleotide Sequence underlying the constraining ReferenceSequence provides the reference coordinate space for the constrained Alignment. Therefore, AlignedSegment objects within a constrained Alignment propose homology between a nucleotide block on a member Sequence and an equal-length block on the constraining ReferenceSequence.

The example constrained Alignment shown below contains the same three member Sequences. In this case the Sequence underlying Member 1 has also been selected as the constraining ReferenceSequence. Therefore, nucleotide columns exist in this Alignment precisely for the nucleotide positions which exist in the Member 1 Sequence. Consequently, columns are included for insertions present in Member 1 relative to Member 2 (e.g. [19,21] in the reference space), and relative to Member 3 (e.g. [24,26]). However, this alignment does not contain columns for insertions present in Member 2 relative to Member 1 (e.g. between 13 and 14), although the Alignment does record the fact that this insertion exists.

Unconstrained Alignments have the advantage of being able to represent the full set of homologies between any pair of member sequences, however they must use an artificial coordinate space to achieve this. Constrained Alignments use a concrete coordinate space but cannot represent homologies within nucleotide columns if those columns only exist in insertions relative to the constraining ReferenceSequence.

The AlignedSegment objects within constrained Alignment objects may be derived from those within unconstrained Alignment objects where both the member and reference sequences are present. However, where the member sequences of the constrained Alignment are known to be closely related to the ReferenceSequence, another possibility exists. In this case the constrained Alignment homologies may be computed using a simple pairwise technique between the member and reference sequence, for example based on BLAST.


Alignment Trees

GLUE projects have the option of using a structure called an alignment tree. This links together alignments in an evolution-oriented way. There are often widely recognised phylogenetic clades such as genotypes within a set of virus sequences. The structure of the alignment tree reflects the phylogenetic relationships between these clades.

An alignment tree is built by creating constrained Alignment objects for each of the clades of interest. These Alignments become nodes within the tree. Where a parent-child relationship between two clades exists within the evolutionary hypothesis, a special relational link is introduced between the corresponding pairs of Alignment objects.

There is a special condition called the alignment tree invariant which is enforced by GLUE when working with alignment trees: If Alignment A is a child of Alignment B, the Sequence acting as the constraining ReferenceSequence of Alignment A must also be a member sequence of Alignment B. In this way, a parent Alignment is forced to contain representative member sequences from any child Alignments. The object structure of an example alignment tree, demonstrating the invariant, is shown in the accompanying diagram.

The constrained Alignment at the root represents an entire virus species. Two child Alignments represent genotypes 3 and 4 (clades within the species). Genotype 3 is further subdivided into two subtypes, 3a and 3b. Each constrained Alignment has a constraining ReferenceSequence. Within each Alignment node there are various AlignmentMember objects, each one records the pairwise homology between the member Sequence and the constraining ReferenceSequence. The alignment tree invariant requires for example that the constraining ReferenceSequence of subtype 3a is also a member of its parent, genotype 3.

There some advantages to using alignment trees in a GLUE project:

  1. For a number of reasons, it's generally useful to organise virus sequences hierarchically, according to clade.
  2. Constrained alignments near the tips of the tree can accurately capture homologies between closely-related sequences.
  3. The alignment tree invariant guarantees that between any two Sequence objects, there is a path of homologies. This facilitates comparisons of distantly related sequences.

Clone this wiki locally