-
Notifications
You must be signed in to change notification settings - Fork 2
Background
The broad-scale availability of virus sequence data is transforming the way that viruses are studied and monitored. With billions of bases generated from sequencing technologies, there is unprecedented potential for advancing knowledge in the field. Sequence data support critical functions, including:
- Surveillance, Control, and Reporting: Facilitating the monitoring of viral diseases and outbreaks.
- Vaccine Development and Diagnostics: Supporting the design and evaluation of vaccines and diagnostic tools.
- Understanding Pathogenesis and Transmissibility: Identifying factors influencing how viruses spread and cause disease.
Despite these advancements, it remains challenging for researchers to make productive use of this data. The complexity of the underlying data, the rapid accumulation of new genome sequences, and the fast pace of technological advances create significant hurdles.
Virus databases are pivotal in enabling researchers to examine viral properties and epidemic patterns in innovative ways. These databases combine genomic data with other critical information, such as epidemiological data, to provide a more complete view of viral ecology and evolution. Sequencing data is particularly valuable as it not only captures the viral genetic code but also links closely to the evolutionary history of the virus, facilitating the tracking of viral lineages and mutation rates.
The tools associated with virus databases are essential for conducting comparative analyses of viral genomes. Key tools include:
-
Comparative Analysis Tools: These tools enable researchers to perform pairwise and multiple sequence alignments, which are critical for understanding the genetic relationships between different viral strains. Alignments serve as the currency of most sequence analyses, allowing for the identification of patterns such as mutations and conserved motifs.
-
Phylogenetic Reconstruction: Phylogenetic tree reconstruction tools facilitate the visualization of evolutionary relationships among viral strains, helping researchers understand how viruses evolve over time and in response to environmental pressures.
The development of sequence databases has a rich history, beginning with early examples of highly sequenced viruses such as Influenza A virus and HIV-1. These viruses served as proving grounds for comparative and phylogenetic approaches, showcasing the potential of genomic analysis in understanding viral behavior.
-
Influenza A Virus: Initially focused on epidemiological studies, the database provided insights into the rate and direction of the virus's spread, as well as its pathways of transmission. Over time, the focus expanded to include efforts to design vaccines, demonstrating the critical role of genomic data in public health responses.
-
HIV-1: The establishment of databases like HIVdb at Stanford and the Los Alamos HIV database marked significant milestones in the use of genomic data to track and respond to the HIV epidemic. These databases facilitated the analysis of drug resistance and the identification of new therapeutic targets.
Despite the advancements, many species-focused databases have been developed but often lack maintenance, leading to significant gaps in resources for viruses such as measles and respiratory syncytial virus (RSV). As sequencing technologies continue to advance, the need for well-maintained databases will only grow.
Given the unique challenges posed by viruses, there is a pressing need for specialized systems like GLUE. Viruses exhibit greater diversity and higher mutation rates than other organisms, making it essential to have tailored approaches for their study. The rapid capacity for evolution among viruses not only complicates their management but also presents opportunities for real-time tracking of viral epidemics.
The complexity of viral genomes, particularly RNA virus genomes, serves as a reminder of the intricate ways these entities manipulate host cells for replication. To realize the full value of virus genome sequencing, sequence data should be processed within 'sequence-oriented resources' like GLUE. These scalable software systems encapsulate domain knowledge relevant to various analysis objectives, enabling researchers to work collaboratively across related fields of study.
One of the foundational principles of GLUE is the explicit separation of the software engine from the GLUE projects. The GLUE engine serves as the core software package, while GLUE projects encapsulate datasets and other items related to specific groups of viruses. This separation enhances modularity and allows users to interact with project data through a straightforward programmatic interface. This interface is versatile enough to be utilized not only in traditional bioinformatics pipelines but also in web resources that leverage GLUE.
GLUE employs a model-driven architecture that defines a comprehensive data schema supporting various virus sequence data resources. Key features include:
-
Standardized Storage: All necessary information for sequence processing, including both data and analysis configurations, is stored within a structured relational database. This organization facilitates efficient data retrieval during computational processes.
-
Simplified Logic Implementation: The use of standard database mechanisms---such as structured queries, relational joins, paging, and caching---simplifies the implementation of higher-level logic. This uniformity allows for easier validation of referential integrity, query syntax, and data exportation.
-
Ease of Deployment: Deploying a GLUE-based resource on a new computer system is straightforward. Users need only to install GLUE and transfer the database contents to ensure that all required data and analysis functionalities are in place.
Alignments are critical to the analysis of virus sequence data, yet creating high-quality multiple sequence alignments (MSAs) can be resource-intensive. GLUE places a strong emphasis on MSAs by treating them as first-class data objects, distinct from sequences. This strategy helps streamline the management and analysis processes associated with MSAs.
- Efficient MSA Management: Given the high genetic variation within viral genomes, developing a robust approach to MSA management is essential. GLUE has introduced new software methods to facilitate the creation and refinement of MSAs, minimizing redundant efforts and ensuring that high-quality alignments can be achieved more efficiently.
GLUE's architecture allows for deployment within standard web servers, exposing its functionalities via web service protocols. This capability enables:
-
Interactive Public Websites: Researchers can build public-facing websites that allow users to interact with virus data and analyses, promoting transparency and collaboration in scientific research.
-
Programmatic Services: GLUE can be integrated into broader computational infrastructures as part of a microservices architecture, facilitating machine-to-machine interactions and supporting automated workflows in research and clinical settings.
As the volume of genomic data continues to grow, the importance of robust, well-maintained databases and tools like GLUE will become increasingly evident. The potential for industrializing virus analysis as a service, alongside the monetization of these services, highlights the need for continuous innovation in bioinformatics resources.
GLUE by Robert J. Gifford Lab.
For questions, issues, or feedback, please open an issue on the GitHub repository.
- Project Data Model
- Schema Extensions
- Modules
- Alignments
- Variations
- Scripting Layer
- Freemarker Templates
- Example GLUE Project
- Command Line Interpreter
- Build Your Own Project
- Querying the GLUE Database
- Working With Deep Sequencing Data
- Invoking GLUE as a Unix Command
- Known Issues and Fixes
- Overview
- Hepatitis Viruses
- Arboviruses
- Respiratory Viruses
- Animal Viruses
- Spillover Viruses
- Virus Diversity
- Retroviruses
- Paleovirology
- Transposons
- Host Genes