Skip to content

Merging Contiguous Hits

Robert J. Gifford edited this page Jun 24, 2024 · 3 revisions

BLAST-based screens will often generate redundant, overlapping & fragmented hits, because BLAST emphasizes local similarity, and will often fragment a matching region of sequence into several contiguous hits. Of course, this may also occur because a match is genuinely discontiguous.

Overlapping or redundant BLAST hits arise when one or more probes in the query set are closely related, and therefore generate similar - though not necessarily identical - hits.

The DIGS tool can be configured to deal with these contingencies in different ways, using parameters specified in the control file.

There are three 'redundancy modes' that can be set:

  • 1: all BLAST hits are treated as distinct, regardless of whether they overlap at all, are within close range of one another, or are entirely redundant. This mode is only suitable when the aim of using the DIGS tool is simply to establish the presence or absence of specific, relatively rare (i.e. non-repetitive) sequence.

  • 2: hits will be merged if they are in the same orientation, have the same value for 'assigned_gene' and are within a specified range of one another.

  • 3: hits will be merged if they are in the same orientation, have the same value for 'assigned_gene' and 'assigned_name', and are within a specified range of one another.

Merging hits in the DIGS results table into larger sequences

When relying on sequence similarity as a means of recovering the sequences of related genome features, a limitation is that the sequences of many interesting genome features are only partially conserved, and large regions of sequence within these features may be rearranged or divergent.

However, when two or more conserved features occur contiguously, their relationship can be used to determine the coordinates of a more complete sequence for the genome feature of interest.

For example, integrated retroviruses ('proviruses') are comprised of internal coding domains (gag, pol, env - in that order), flanked by terminal LTRs that are usually (though not always entirely) non-coding. However, endogenous retroviruses (ERVs) frequently have much complex genome arrangements, with many being fragmentary or mosaic in structure, and large regions of the integrated provirus often being highly divergent from anything seen previously.

Accordingly, it makes sense to screen first using individual features (i.e. Gag, Pol, Env polypeptides, plus LTR nucleotide sequences), as probes and references, then to consolidate the hits to these probes into larger sequences comprised of the hits, plus the intervening sequences. At the same time, we can record the relationship between the component parts of the merged sequence, where merging occurs.

The DIGS tool can be used to implement a ‘consolidation’ of this kind. Contiguous hits in the ‘digs_results’ table are merged based on whether they are within a user-defined distance of one another.

Running the consolidation process produces a set of merged sequences, and also classifies these sequences using the same approach applied when generating the digs_results table. The results - i.e. a non-overlapping set of sequences, merged as determined by user-specified rules - are entered into the 'loci' table (see the database schema page for details). A separate reference sequence library that is appropriate for classifying the longer sequences should be used for classifying the consolidated results, and is specified by a distinct parameter (see section 4 in the set-up stages above).

The loci table contains most of the same fields as the digs_results table, but also includes a 'locus_structure' field that records the relationship between merged hits, including their orientation relative to one another.

The locus table includes a field 'locus-structure' that shows the order and orientation of the individual hits from the digs-results table that were combined to create the merged hit, as shown below.

Clone this wiki locally