-
Notifications
You must be signed in to change notification settings - Fork 1
5 Optimizing sensitivity and resource usage
This section outlines how to manage Vclust's run time, memory, and disk space usage while maintaining high sensitivity. These considerations are essential if your dataset contains millions of sequences or has a high level of redundancy.
The prefilter
command generally demands more computational resources (i.e., memory, runtime, and disk space) compared to the align
and cluster
commands. This is primarily due to its all-versus-all pairwise comparisons of genomes. If not configured correctly, the prefilter step can consume a substantial amount of resources, particularly when dealing with large or highly redundant datasets. In addition, the prefilter command can significantly impact the runtime and memory usage of subsequent align
and cluster
commands, as it dictates the number of pairwise alignments to be performed. To optimize performance and mitigate excessive resource consumption, it is essential to adjust the following three parameters.
By default, the prefilter
command processes all genomes at once, but it can also operate in smaller, equally-sized sequence batches. This option significantly reduces memory requirements without affecting sensitivity, although it may slightly increase runtime. For example, processing 15.5 million IMG/VR contigs in batches of 2 million sequences reduced memory usage from 1 TB to 256 GB, with only a 30-minute increase in runtime.
The --batch-size
option specifies the number of sequences to process in each batch.
# Process genomes in batches of 2 million sequences.
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.95 --batch-size 2000000
By default, the prefilter
command analyzes all k-mers for each genome, but you can limit the analysis to a fraction of k-mers to reduce memory usage and runtime with minimal impact on sensitivity. For example, analyzing 20% of the k-mers from 15.5 million IMG/VR contigs recalled nearly all genome pairs with ANI ≥ 95% and AF ≥ 85% (~196 million pairs), with fewer than 100 missed pairs and false positives, while reducing memory and runtime by nearly five-fold. This option does not affect the precision of ANI values in the subsequent align
command, as alignments are conducted on full sequences.
The --kmers-fraction
option controls the proportion [0-1] of k-mers used in comparisons:
# Process genomes in batches and analyze 20% of k-mers in each genome sequence.
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.95 --batch-size 2000000 \
--kmers-fraction 0.2
By default, the prefilter
command returns all genome pairs that meet the user-defined thresholds for the minimum number of common k-mers (--min-kmers
) and sequence identity (--min-ident
). However, for highly redundant datasets (e.g., hundreds of thousands of nearly identical genome sequences), prefilter
may still pass a large number of genome pairs, increasing memory usage, runtime, and disk space.
The --max-seqs
option limits the number of target sequences reported for each query genome, reducing the overall number of genome pairs passing the prefilter step. For each query, --max-seqs
returns up to n sequences that have passed the --min-kmers
and --min-ident
filters, and have the highest sequence identity to query sequence. For example, in a dataset containing 1 million nearly identical genomes, the total number of possible genome pairs is nearly 500 billion, but setting --max-seqs 1000
reduces this to 1 billion pairs, significantly decreasing memory usage, runtime, and disk space.
# Limit the number of target sequences to top 1000 per query genome.
vclust prefilter -i genomes.fna -o fltr.txt --min-ident 0.95 --batch-size 100000 \
--kmers-fraction 0.2 --max-seqs 1000
In summary:
Option | RAM memory | Runtime | Disk space | Sensitivity |
---|---|---|---|---|
--batch-size |
decrease | slight increase | slight increase | no effect |
--kmers-fraction |
decrease | decrease | decrease | no or minor effect |
--max-seqs |
decrease | decrease | decrease | decrease |
- Features
- Installation
- Quick Start
- Usage
- Optimizing sensitivity and resource usage
-
Use cases
- Classify viruses into species and genera following ICTV standards
- Assign viral contigs into vOTUs following MIUViG standards
- Dereplicate viral contigs into representative genomes
- Calculate pairwise similarities between all-versus-all genomes
- Process large dataset of diverse virus genomes (IMG/VR)
- Process large dataset of highly redundant virus genomes
- Cluster plasmid genomes into pOTUs
- FAQ: Frequently Asked Questions