-
Notifications
You must be signed in to change notification settings - Fork 6
Getting Started
This guide provides an overview of the steps required to set up and run the DIGS tool, including details on the input data components and what to expect in terms of setup time and computational requirements. DIGS is a powerful tool for performing similarity-based searches in whole genome sequence data, and this guide will help you get a sense of what is needed to use it effectively.
Before running DIGS, you'll need to prepare the following key data components:
-
Target Database (TDb):
- A collection of whole genome sequence or transcriptome assemblies that will serve as the target for similarity searches.
- The target database should contain the sequences you aim to analyze and is a critical part of the screening process.
-
Query Sequences (Probes):
- Input sequences that will be used to perform similarity searches against the Target Database.
- These sequences should be carefully chosen to match the type of genetic features or viral elements you are investigating.
-
Reference Sequence Library (RSL):
- Represents the genetic diversity associated with the genome feature(s) under investigation.
- The reference library is used to contextualize the results of the similarity searches and validate the detected sequences.
The time required to set up and run DIGS depends on several factors, including your platform, computational resources, and the size of your datasets. Here's an outline of what to expect at each stage of the process.
- Time Estimate: Installation should only take a few minutes, particularly for experienced bioinformaticians working on LINUX/UNIX systems.
-
Details:
- The DIGS tool requires several widely used bioinformatics components, including PERL, BLAST, and MySQL.
- Most bioinformatics servers will already have these programs installed. If installation is necessary, it should be straightforward on LINUX/UNIX operating systems.
-
Mac Users: Installing DIGS on a Macintosh computer can be less predictable. Specifically, the PERL library
DBD::MySQL
does not come pre-installed and may require additional configuration. For guidance, refer to Mac installation instructions.
- Time Estimate: This step might take longer depending on the complexity of your research question and data preparation.
-
Details:
- This stage involves selecting and formatting your probes, reference sequences, and target database.
- It is crucial to carefully plan which target genomes you will screen and what kind of sequences you are searching for.
- Spend time framing the research question you aim to address to ensure you select the most relevant targets and queries.
- Time Estimate: Only a few minutes.
-
Details:
- The control file defines the parameters for the DIGS screening process and should be structured according to the specifications.
- Refer to the control file guide for detailed instructions on creating this file.
- Time Estimate: Hours to days, depending on the size of your datasets and computational resources.
-
Details:
- DIGS performs similarity search-based screening, which can be computationally intensive and time-consuming.
- The length of time required to run a screen will depend on factors like the size of the target database and the abundance or scarcity of matches for the query sequences.
- For long-running screens on a server, consider running the process in the background to avoid disruptions. Detailed instructions on running DIGS in the background can be found here.
- DIGS provides real-time updates on the progress of the screen, indicating how many queries have been executed.
- It's important to keep in mind that screen duration will vary with the complexity of the search and the computational power available.
- Installation: Fast and straightforward on LINUX/UNIX systems; may require additional steps on Macintosh.
- Data Preparation: Requires careful planning to ensure relevant probes, references, and targets are selected.
- Screening: Can range from a few hours to several days depending on data size and system capabilities.
- Control File: Quick setup, guiding how DIGS performs its searches.
By understanding these steps and planning your data preparation carefully, you'll be able to use the DIGS tool efficiently to uncover the viral 'fossil record' hidden within genome sequences.