-
Notifications
You must be signed in to change notification settings - Fork 2
Update GTDB taxonomy
The taxonomy for the genome tree is in a fairly constant state of flux. As such, it needs to be updated on a regular basis. Updating the database is complicated by a few factors:
- provided taxonomy files often do not cover all 7 canonical ranks and may not properly use binomial names for species
- the taxonomy of representative genomes must be propagated to all genomes clustered with the representative
- updated taxonomy files may not cover all genomes in the database as curators often focus on specific parts of the tree
The GenomeTreeTk/gtdb_validation_tk provide functionality for allowing new taxonomy strings to be quickly corrected, verified, and expanded:
- run the
genometreetk fill_ranks
command so all taxonomy strings cover all 7 canonical ranks - combine the final bacterial and archaeal taxonomies
cat gtdb_r207_bac120_curation_taxonomy.tsv gtdb_r207_ar53_curation_taxonomy.tsv >> final_taxonomy_combined.tsv
- run the
gtdb_migration_tk propagate_curated_taxonomy
to propagate the taxonomy from the final taxonomy file (using canonical ids) to all genomes in the species cluster.gtdb_migration_tk propagate_curated_taxonomy -t final_taxonomy_combined.tsv -m metadata_r207.tsv -o propagated_taxonomy.tsv
- re-run the
gtdb_validation_tk check_file
command to ensure the taxonomy file is properly formatted and to identify potential issues - The final taxonomy file can then be inserted into the GTDB using the
gtdb-migration-tk add_taxonomy_to_database
functiongtdb-migration-tk add_taxonomy_to_database --hostname watson.ace.uq.edu.au -u gtdb -d gtdb_pierre_r207 -p ecogenomicsgtdb --taxonomy_file propagated_taxonomy.tsv -m metadata_r207.tsv --truncate_taxonomy
Note: This script only updates the taxonomy of the specified genomes. If the new taxonomy covers all genomes passing QC and other criteria it may be necessary to first set the gtdb_taxonomy, gtdb_phylum, ..., gtdb_species entires to NULL in the database:
UPDATE metadata_taxonomy SET gtdb_phylum = NULL, gtdb_class = NULL, gtdb_order = NULL, gtdb_family = NULL, gtdb_genus = NULL, gtdb_species = NULL;
The gtdb_domain field should never be set to NULL as this field is used for filtering purposes. If the taxonomy is domain specific be careful not to accidentally wipe out the other domain (i.e., clearing the database and adding back only a new bacterial taxonomy).
IMPORTANT To update the Taxonomy from one release to another we only clean the gtdb ranks for Refseq,Genbank and UBA genomes. To clear all the field for Bacterial genomes:
UPDATE metadata_taxonomy SET gtdb_phylum = NULL, gtdb_class = NULL, gtdb_order = NULL, gtdb_family = NULL, gtdb_genus = NULL, gtdb_species = NULL, gtdb_taxonomy = NULL WHERE gtdb_domain like 'd__Bacteria' and id in ( SELECT id from genomes ge where genome_source_id != 1 OR ge.id in (SELECT genome_id FROM genome_list_contents WHERE list_id = 479))
To clear all the field for Archaeal genomes:
UPDATE metadata_taxonomy SET gtdb_phylum = NULL, gtdb_class = NULL, gtdb_order = NULL, gtdb_family = NULL, gtdb_genus = NULL, gtdb_species = NULL, gtdb_taxonomy = NULL WHERE gtdb_domain like 'd__Archaea' and id in ( SELECT id from genomes ge where genome_source_id != 1 OR ge.id in (SELECT genome_id FROM genome_list_contents WHERE list_id = 479))