-
Notifications
You must be signed in to change notification settings - Fork 3
Retrieving genomes from NCBI GenBank
In order to build a database, MetaSBT requires a set of reference genomes first. This means that their taxonomic labels are all known from the kingdom up to the species level.
We provide a Python 3 utility get_ncbi_genomes.py
under the scripts
folder able to retrieve all the reference genomes and metagenome-assembled genomes under a specific superkingdom (and optionally a specific kingdom) from NCBI GenBank.
The following example allows to retrieve all the reference genomes under the Bacteria
superkingdom:
python ./scripts/get_ncbi_genomes.py --superkingdom Bacteria \
--type reference \
--download \
--out-dir ~/genomes \
--nproc 8
Please note that this utility is also available within the MetaSBT installation. Therefore, if you installed the framework through pip
or conda
, you may not need to specify the interpreter nor the scripts path.
Option | Default | Mandatory | Description |
---|---|---|---|
--db-dir |
Path to the root folder of a MetaSBT database. It will avoid considering genomes that are already present in the database | ||
--download |
False |
Download genome files. It just reports the list of genomes and their taxonomic labels as they appear in NCBI if not specified | |
--help |
Print the list of arguments and exit | ||
--kingdom |
Consider genomes whose lineage belongs to a specific kingdom. It is optional and must be used in conjunction with --superkingdom (e.g., --superkingdom Eukaryota --kingdom Fungi ) |
||
--nproc |
1 |
Retrieve genomes in parallel | |
--out-dir |
Path to the output folder. It downloads genomes in the current folder if not specified | ||
--superkingdom |
Consider genomes whose lineage belongs to a specific superkingdom. Available values: Archaea , Bacteria , Eukaryota , and Viruses
|
||
--taxa-level-id |
Consider genomes under a specific taxonomoic level. Available values: phylum , class , order , family , genus , and species
|
||
--taxa-level-name |
Name of the taxonomic level. Must be used in conjunction with --taxa-level-id (e.g., --taxa-level-id species --taxa-level-name "Campylobacter coli" ) |
||
--type |
Retrieve reference genomes or metagenome-assembled genomes (MAGs). Available values: reference and mag . It retrieves both reference genomes and MAGs if not specified |
||
--version |
Print the script version and exit |
It is worth to note that this script implements a set of rules for establishing whether a genome in NCBI GenBank must be considered as a reference genome or metagenome-assembled genome.
Everything relies on the most updated NCBI GenBank Assembly Summary Report table. In particular, this script first check for the absence of some tags under the excluded_from_refseq
column in order to consider a genome to be downloaded. In case a genome is tagged with at least one of the following tags, it is automatically excluded:
abnormal gene to sequence ratio
chimeric
contaminated
genome length too large
genome length too small
hybrid
low gene count
low quality sequence
many frameshifted proteins
metagenome
misassembled
mixed culture
untrustworthy as type
In case a genome passes the exclusion criteria, it will be considered as a reference genome if one or more of the following tags occur under the excluded_from_refseq
column in the same assembly summary table:
derived from single cell
derived from surveillance project
assembly from type material
assembly from synonym type material
assembly designated as neotype
assembly designated as reftype
assembly from pathotype material
assembly from proxytype material
missing strain identifier
genus undefined
from large multi-isolate project
A genome is considered as a metagenome-assembled genome if it passes the exclusion criteria but none of the reference tags appears under the excluded_from_refseq
column.
Please refer to the following page on the NIH website for additional information about these tags https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/
MetaSBT | Releases | Wiki | MetaSBT-DBs | License