Skip to content
Robert J. Gifford edited this page Jun 23, 2024 · 7 revisions

Screening genomes in silico

Sequence similarity search tools, such as the basic local alignment search tool (BLAST), detect regions of local similarity between molecular sequences. Similarity search tools are more-or-less indispensable for biological sequence analysis. They can be used to characterize a locus in detail, helping to identify the coordinates of specific sequence features at protein or nucleic acid level (e.g. conserved protein motifs, oligonucleotide primer sites). They can also be used as a kind of ‘search engine’ for retrieving similar (and thus potentially evolutionarily related) sequences from sequence databases. This second functionality is especially useful for comparative, evolutionary studies, and increasingly so given the speed at which sequence data are now accumulating.

The basic functions supplied by BLAST can be elaborated into entire investigative strategies for comparative analysis of genes and genomes. This may involve using different combinations of probe sequences and target databases, or linking BLAST searches with other methods of sequence analysis (e.g. phylogenetic or statistical analysis).

BLAST-based approaches are especially useful when investigating genomic features that are not well annotated in public sequence databases, such as small RNAs, pseudogenes, transposable elements, highly duplicated gene families, and endogenous viral elements (EVEs). More broadly, BLAST searches can form the backbone of heuristic in silico investigations wherein the overall strategy is loosely defined and there is a requirement to be able to proceed through multiple iterations of trial and error, using new information recovered by in each iteration to update and refine the overall strategy.

Although systematic BLAST ‘screens’ of genome databases are an important component of many comparative genomics investigations, efficiently implementing these procedures and integrating them into bioinformatics workflows can present a technical challenge.

The database-integrated genome-screening (DIGS) tool, which aims to provide a robust and extensible framework for implementing systematic, BLAST-based in silico screens of molecular sequence databases and interrogating the data they produce. To demonstrate how the DIGS tool can be applied, we perform a range of validate the DIGS tool a ‘gold standard’ dataset, and demonstrate its application to an investigation of EVE diversity in mammalian genomes.