Skip to content

Retrieving genomes from NCBI GenBank

Fabio Cumbo edited this page May 2, 2023 · 2 revisions

In order to build a database, MetaSBT requires a set of reference genomes first. This means that their taxonomic labels are all known from the kingdom up to the species level.

We provide a Python 3 utility get_ncbi_genomes.py under the scripts folder able to retrieve all the reference genomes and metagenome-assembled genomes under a specific superkingdom (and optionally a specific kingdom) from NCBI GenBank.

The following example allows to retrieve all the reference genomes under the Bacteria superkingdom:

python ./scripts/get_ncbi_genomes.py --superkingdom Bacteria \
                                     --type reference \
                                     --download \
                                     --out-dir ~/genomes \
                                     --nproc 8

Please note that this utility is also available within the MetaSBT installation. Therefore, if you installed the framework through pip or conda, you may not need to specify the interpreter nor the scripts path.

Available options

Option Default Mandatory Description
--db-dir Path to the root folder of a MetaSBT database. It will avoid considering genomes that are already present in the database
--download False Download genome files. It just reports the list of genomes and their taxonomic labels as they appear in NCBI if not specified
--help Print the list of arguments and exit
--kingdom Consider genomes whose lineage belongs to a specific kingdom. It is optional and must be used in conjunction with --superkingdom (e.g., --superkingdom Eukaryota --kingdom Fungi)
--nproc 1 Retrieve genomes in parallel
--out-dir Path to the output folder. It downloads genomes in the current folder if not specified
--superkingdom Consider genomes whose lineage belongs to a specific superkingdom. Available values: Archaea, Bacteria, Eukaryota, and Viruses
--taxa-level-id Consider genomes under a specific taxonomoic level. Available values: phylum, class, order, family, genus, and species
--taxa-level-name Name of the taxonomic level. Must be used in conjunction with --taxa-level-id (e.g., --taxa-level-id species --taxa-level-name "Campylobacter coli")
--type Retrieve reference genomes or metagenome-assembled genomes (MAGs). Available values: reference and mag. It retrieves both reference genomes and MAGs if not specified
--version Print the script version and exit

Genome type identification

It is worth to note that this script implements a set of rules for establishing whether a genome in NCBI GenBank must be considered as a reference genome or metagenome-assembled genome.

Everything relies on the most updated NCBI GenBank Assembly Summary Report table. In particular, this script first check for the absence of some tags under the excluded_from_refseq column in order to consider a genome to be downloaded. In case a genome is tagged with at least one of the following tags, it is automatically excluded:

abnormal gene to sequence ratio
chimeric
contaminated
genome length too large
genome length too small
hybrid
low gene count
low quality sequence
many frameshifted proteins
metagenome
misassembled
mixed culture
untrustworthy as type

In case a genome passes the exclusion criteria, it will be considered as a reference genome if one or more of the following tags occur under the excluded_from_refseq column in the same assembly summary table:

derived from single cell
derived from surveillance project
assembly from type material
assembly from synonym type material
assembly designated as neotype
assembly designated as reftype
assembly from pathotype material
assembly from proxytype material
missing strain identifier
genus undefined
from large multi-isolate project

A genome is considered as a metagenome-assembled genome if it passes the exclusion criteria but none of the reference tags appears under the excluded_from_refseq column.

Please refer to the following page on the NIH website for additional information about these tags https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/