This repository provides a comprehensive Nextstrain analysis of "your virus". You can choose to perform either a shorter run with specific proteins or a full genome run.
For those unfamiliar with Nextstrain or needing installation guidance, please refer to the Nextstrain documentation.
The data for this analysis is available from NCBI Virus. Instructions for downloading sequences are provided under Sequences.
This repository includes the following directories and files:
scripts
: Custom Python scripts called by thesnakefile
.snakefile
: The entire computational pipeline, managed using Snakemake. Snakemake documentation can be found here.ingest
: Contains Python scripts and thesnakefile
for automatic downloading of <your_virus> sequences and metadata.protein_xy
: Sequences and configuration files for the specific protein_xy run.whole_genome
: Sequences and configuration files for the whole genome run.
The config
, protein_xy/config
, and whole_genome/config
directories contain necessary configuration files:
colors.tsv
: Color schemegeo_regions.tsv
: Geographical locationslat_longs.tsv
: Latitude datadropped_strains.txt
: It will exclude these accessions duringaugur filter
clades_genome.tsv
: Manually Labeling Clades on a Nextstrain Tree (see documentation here)reference_sequence.gb
: Reference sequence (add manually)auspice_config.json
: Auspice configuration file - has to be in all data folders!
The reference sequence used is XYZ, accession number, sampled in 19XX.
Install the Nextstrain environment by following these instructions.
Activate the Nextstrain environment:
micromamba activate nextstrain
To perform a build, run:
snakemake --cores 9 all
For specific builds:
- protein_xy build:
snakemake auspice/<your_virus>_protein_xy.json --cores 9
- Whole genome build:
snakemake auspice/<your_virus>_whole-genome.json --cores 9
To visualize the build, use Auspice:
auspice view --datasetDir auspice
To run two visualizations simultaneously, you may need to set the port:
export PORT=4001
For more information on how to run the ingest
, please refer to the README in the ingest
folder.
Sequences can be downloaded manually or automatically.
- Manual Download: Visit NCBI Virus, search for
<your_virus>
or TaxidXXXXXX
, and download the sequences. - Automated Download: The
ingest
functionality, included in the mainsnakefile
, handles automatic downloading.
The ingest pipeline is based on the Nextstrain RSV ingest workflow. Running the ingest pipeline produces data/metadata.tsv
and data/sequences.fasta
.