Skip to content
Robert J. Gifford edited this page Jun 23, 2024 · 7 revisions

Screening genomes in silico

Sequence similarity search tools, such as the Basic Local Alignment Search Tool (BLAST), are essential for biological sequence analysis. These tools detect regions of local similarity between molecular sequences and are invaluable for various purposes. They can be used to characterize a locus in detail, helping to identify the coordinates of specific sequence features at the protein or nucleic acid level (e.g., conserved protein motifs, oligonucleotide primer sites). Additionally, they serve as a 'search engine' for retrieving similar sequences from databases, which may indicate evolutionary relationships. This function is particularly useful for comparative and evolutionary studies, especially given the rapid accumulation of sequence data.

The basic functions of BLAST can be expanded into comprehensive investigative strategies for comparative analysis of genes and genomes. This might involve using different combinations of probe sequences and target databases or integrating BLAST searches with other sequence analysis methods (e.g., phylogenetic or statistical analysis).

BLAST-based approaches are particularly useful for investigating genomic features that are poorly annotated in public databases, such as small RNAs, pseudogenes, transposable elements, highly duplicated gene families, and endogenous viral elements (EVEs). More broadly, BLAST searches can underpin heuristic in silico investigations, where the overall strategy is loosely defined and requires multiple iterations of trial and error, using new information from each iteration to refine the approach.

While systematic BLAST screens of genome databases are crucial for many comparative genomics investigations, efficiently implementing these procedures and integrating them into bioinformatics workflows can be technically challenging.

Input Data Components

  1. Target Database (TDb): A collection of whole genome sequence or transcriptome assemblies serving as the target for similarity searches.
  2. Query Sequences (Probes): Input sequences for similarity searches of the Target Database.
  3. Reference Sequence Library (RSL): Represents the genetic diversity associated with the genome feature(s) under investigation.

Database-Integrated Genome Screening (DIGS)

Similarity searches enable researchers to selectively recover similar – thus potentially related – sequences from the vast quantity of sequences held within sequence databases. In database-integrated genome-screening (DIGS), the output of similarity search-based genome 'screens' is captured in a relational database. This facilitates the implementation of automated screens that can be performed on a large scale, and allows for the interrogation and manipulation of output data using structured query language (SQL).

The Database-Integrated Genome-Screening (DIGS) tool aims to provide a robust and extensible framework for systematic, BLAST-based in silico screens of molecular sequence databases and for interrogating the resulting data. The DIGS tool uses the basic local alignment search tool (BLAST) to perform sequence similarity searches.

Features

  • Sequence Similarity Search: Integrate tools like BLAST to find related nucleotide sequences.
  • Relational Database Integration: Store and manage large sets of sequence data efficiently.
  • Data Analysis: Utilize a range of tools to analyze and visualize genomic data.
  • Comparative Genomics: Perform in-depth comparative studies across different genomes.

The DIGS tool utilizes a screening approach based on two rounds of BLAST, a strategy we refer to as 'paired BLAST'. In the first round, query sequences selected from the reference sequence library (the 'probes') are used to search target databases. In the second, sequence ‘hits’ identified by screening are extracted from genomes and assigned a genotype by BLAST comparison to the reference sequence set.

The second BLAST step is included because probe sequences can often cross-match to a wide range of homologous sequences in the initial BLAST screen. For example, consider a gene that has two paralogs, ‘X’ and ‘Y’. Screening with a probe of type X may yield hits to both X and Y. Comparing hits to a library of representative reference sequences in the second BLAST step provides an efficient means for users to discriminate which hits are more like X’s and which are more like Y’s.