Goal: retrieve all children organism under an ancestor in NCBI taxonomy
1a. Download preprocessed data (last update: 1 Feb 2024) here
Download taxonomy_with_all_children.csv
which is the csv you may need to analyze NCBI taxonomy tree.
You can also use the Pyton scripts as follow to download latest taxonomy from NCBI FTP and preprocess the data.
- Download taxdmp.zip from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.
- Unzip taxdmp.zip and place
nodes.dmp
andnames.dmp
in this folder. - Run
nodes_to_csv.py
andnames_to_csv.py
to getnodes.csv
andnames.csv
respectively. - Run
concat_names_to_nodes.py
to gettaxonomy.csv
. - Compute the direct children of each organism (node) using
get_direct_children_from_tax.py
to gettaxonomy_with_direct_children.csv
. - Compute all children (may take several hours) using
get_all_children_from_tax.py
to gettaxonomy_with_all_children.csv
. - Run
query.py --ancestor 8782
to retrieve all chilren organism with the ancestor Aves. Replace 8782 with the tax_id of the ancestor you decide.
taxonomy_with_all_children.csv
is the final csv you may need to analyze NCBI taxonomy tree.
Instead of get_all_children_from_tax.py, you can use create_library_index.py to generate hierarchical library indices for each node in the taxonomy. A library index is a hierarchical numbering system that encodes the parent-child relationships in the taxonomy tree. It assigns each node an index that reflects its position in the hierarchy.
Example:
2 is child of 1: 1.2
3 is child of 1: 1.3
4 is child of 2: 1.2.4
Benefits: We can Retrieve all descendants of an ancestor by filtering for library indices that start with the ancestor's library index.
Run python create_library_index.py
to get taxonomy_with_library_index.csv
Run query.py --ancestor 8782 --method library
to use taxonomy_with_library_index.csv
- get all children of any organism
- after getting all scientific_names of all children of an organism (ancestor), you can retrieve all SRA data related to all organisms with the same ancestor from BigQuery by running the generated SQL in BigQuery
Note: NCBI hosts SRA data in BigQuery. It is convenient for large amount of data retrieval.
SELECT *
FROM `nih-sra-datastore.sra.metadata`,
WHERE organism = "Homo sapiens";