Skip to content

Updating GTDB species representatives

Donovan H. Parks edited this page Oct 30, 2024 · 125 revisions

Updating and selecting new GTDB species representatives is done using the GTDB Species Cluster Toolkit. It consists of a number of steps which need to be done in serial and may require manual intervention to resolve challenging cases not currently covered by the implemented methodology. Such cases are indicated as warnings or errors in the output log of each step and may result in a specific output file which can be used to further understand the underlying issue.

Once the GTDB species clusters are established, a new curation tree can be inferred for the species representatives and this tree provided to the curation team for manual curation of taxa above the rank of species. The curation team may also update the name (but not the representative genome) of species clusters based on the latest taxonomic opinion as expressed in published literature. Once manual curation is finished a set of post-curation validation steps have been implemented to identify common issues. It is typical for some issues to arise during validation and these must be resolved before publishing the GTDB.

GTDB Ledgers: a necessary evil

GTDB currently uses a ledger system (set of files) to indicate exceptions that would otherwise result in incorrect formation of species clusters or naming of taxa. These files are all stored on OneDrive in ./ledgers/. These ledger add a substantial amount of complexity and data ambiguity to the GTDB, and this scheme should be re-examined at some point. A few notes on side effects and issues caused by the ledgers:

  • ledgers are filled out as accurately as possible using data available at the time. This means that ledger entries may become invalid as external data changes. For example, a genome is added to the ncbi_untrustworth_sp_assignment ledger, but the species assignment of this genomes is later updated at NCBI. Ledgers should be inspected before each release to ensure they are still valid, but in practice this can be challenging given limited resources and the manual nature of verifying the correctness of ledger entries.
  • ledgers effectively change the interpretation of data as it standard in the GTDB and this isn't reflected on the GTDB website. For example, the ncbi_env_bioprojects ledger indicates NCBI BioProjects where all genomes should be considered MAGs. However, the data in this ledger is never propagated to the GTDB database and is not reflected on the GTDB website.

Identifying "long-branch" genomes

The tree from the prior release should be manually inspected and any genomes on excessively long branches that represent clear issues should be added to the qc_expections ledger with the Include field set to FALSE.

Naming convention

This document uses the following two placeholders to indicate the name of paths and files:

  • <cur>: indicates the current release number, e.g. 214
  • <prev>: indicates the previous release number, e.g. 207

Updating GTDB CLI to new DB and pulling DB data

Metadata, genome path, and domain information files are pulled from the GTDB using the GTDB CLI. This must be updated to point to the latest GTDB database by updating the following fields in Config.py:

  • NCBI_PREFIX
  • DB_SERVERS
  • LATEST_DB

Once updated, the following TSV files can be generated using the GTDB CLI:

  • gtdb metadata export --format tab --output gtdb_r<cur>_metadata.tsv
  • gtdb power genome_paths --output gtdb_r<cur>_genome_paths.tsv
  • gtdb power domain_report --output gtdb_r<cur>_domain_report.tsv

Data files required for species clustering

The species clustering relies on a number of data files providing nomenclatural information or metadata about genome assemblies:

  • assembly_summary_genbank.txt: located in ./metadata/release<cur>/ncbi/taxonomy
  • gtdb_r<prev>_metadata.tsv: created by uncompressing and concatenating the ar53_metadata_r<prev>.tar.gz and bac120_metadata_r<prev>.tar.gz files and removing the redundant header line

Establishing GTDB species clusters

Species clusters are updated by running all the commands beginning with a u in the order specified in the CLI help menu for the GTDB Species Cluster Toolkit. This workflow should be performed in the directory /srv/db/gtdb/metadata/release<cur>/representatives/sp_cluster_update as follows:

  1. u_new_genomes: identify new and updated genomes in current GTDB release
    • gtdb_species_clusters u_new_genomes gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv assembly_summary_genbank.txt 1_u_new_genomes
    • good sanity checks are to ensure the number of lost GTDB representatives and genomes without GTDB metadata as indicated in the log file are both low (say, <50) and that the number of new genomes corresponds to expectations
  2. u_qc_genomes: quality check new and updated genomes
    • gtdb_species_clusters u_qc_genomes gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv assembly_summary_genbank.txt gtdb_r<cur>_domain_report.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/qc_exceptions.tsv ../ledgers/ncbi_env_bioprojects.tsv 2_u_qc_genomes
    • the ledgers are stored on OneDrive in the ledgers directory; files should be copied from the previous release (e.g. r207) to a new directory (e.g. r214); ledger entries must be copied into a TSV files for processing by the GTDB Species Cluster Toolkit (note that some ledgers have 2 sheets!)
    • the log file will report any cases where the GTDB domain report, GTDB taxonomy, and NCBI taxonomy disagree. It is critical that the GTDB taxonomy indicate the correct domain as this will be used to determine which domain level tree the genome belongs in. If there is uncertainty the genome should be added to the qc_exceptions ledger and the Include field for the genome set to FALSE.
  3. u_gtdbtk: perform initial classification of new and updated genomes using GTDB-Tk
    • gtdb_species_clusters u_gtdbtk -c 64 ./1_u_new_genomes/genomes_new_updated.tsv ./2_u_qc_genomes/qc_passed.tsv 3_u_gtdbtk
    • ToDo: this step is extremely resource intensive and takes a couple days to run even running batches in parallel across several servers. Perhaps a prefiltering step should be done that uses Mash and FastANI to identify genomes that belong to existing GTDB species clusters (Note: Pierre is currently implementing this approach into GTDB-Tk). In addition, or alternatively, perhaps genomes from common species (e.g. E. coli) can just be skipped in some way though in R207 <20% of new genomes were to highly sampled species (defined as having >1000 genomes) and >60% had no NCBI species assignment.
    • Recommendation: until this step can be made less resource intensive I suggest running it across multiple servers so it takes only a day or two to complete
    • GTDB-Tk results are used as the initial GTDB assignment for new genomes. This helps with manual curation since the majority of new genomes now have correct GTDB classifications. GTDB-Tk must be using reference data from the last GTDB release (e.g. R202 for the R207 update).
  4. u_lpsn_rna_types: identify type genomes based on type 16S rRNA sequences indicated at LPSN
    • gtdb_species_clusters u_lpsn_rna_types processed_lpsn_data.tsv gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ../ledgers/gtdb_type_strains.tsv ../ledgers/gtdb_untrustworthy_type_genomes.tsv 4_u_lpsn_rna_types -c 40
    • the processed_lpsn_data.tsv file should be a symlink to the lpsn/lpsn_<date>/parse_html/all_ranks/full_parsing_parsed.tsv file
    • genomes in the output table lpsn_ssu_type_genomes.tsv where the Is GTDB type genomes is FALSE should be added to the gtdb_type_strains ledger (both on OneDrive and in the ../ledgers directory) unless there is conflicting evidence suggesting the genome is not in fact the type strain of the species despite having a type 16S sequence. I recommend re-running this command to verify there are no missing type genomes (or that any exceptions are understood).
  5. u_resolve_types: resolve cases where a species has multiple genomes assembled from the type strain
    • gtdb_species_clusters u_resolve_types gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ltp_<date>_taxonomy.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/gtdb_untrustworthy_type_genomes.tsv ../ledgers/ncbi_env_bioprojects.tsv 5_u_resolve_types -c 40

    • the ltp_<date>_taxonomy.tsv file can be found in /srv/db/silva/ltp/<date> and should be updated with each GTDB release to the latest LTP release on the LTP website

    • the file config.py needs to be updated to point to the latest LTP results; ideally, this should be read from a global config file that is specific to each release, but this does not exist yet

    • the file unresolved_type_strain_genomes.tsv lists species with multiple divergent genomes indicated as being assembled from the type strain and where the correct genome could not be automatically resolved. Ideally, this file would be manually inspected and external resources consulted to determine which of the genomes is actually assembled from the type strain of the species. In practice, this is very challenging and the <10 cases that occur have just been left unresolved. This means a random genome will be selected as the GTDB representative which isn't ideal, but is sensible given resource limitations to validate the correct assembly.

    • the key file produced by this step is untrustworthy_type_material.tsv which lists genomes that are annotated as being assembled from the type strain of a species, but where evidence exists indicating the genome is not type material. This occurs for any species where there are multiple, divergent genomes all claiming to be assembled from the type strain and some criteria could be used to establish the correct or at least most likely genome to actually be assembled from the type strain.

  6. u_rep_changes: identify lost, updated, reassigned, and unchanged species representatives
    • gtdb_species_clusters u_rep_changes gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv ./1_u_new_genomes/genomes_new_updated.tsv ./2_u_qc_genomes/qc_passed.tsv ./3_u_gtdbtk/gtdbtk_classify.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ../ledgers/gtdb_disband_cluster.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/ncbi_env_bioprojects.tsv 6_u_rep_changes
    • warnings should be investigated and appropriate action taken though admittedly some of these have proven extremely hard to understand (e.g. reassignment of G000817735 in GTDB R202). One common warning is Updated GTDB representative XZY reassigned from ... which occurs when the genomic FASTA file from a genome is updated and as a result it now clusters with a new species cluster instead of with its previous version. Such warning can be ignored as they are handled by the next step.
    • the gtdb_disband_cluster.tsv ledger indicates GTDB species clusters in the previous GTDB release that should be explicitly disbanded (i.e. no longer given any priority to form a species cluster in this release). In general, this ledger should be empty. In the past, we have used this ledger to indicate that Shigella species clusters should be disbanded since we decided to make them synonyms of E. coli (see here). This ledger was also needed when we changed the alignment fraction (AF) criteria from 0.65 to 0.5. All clusters that were impacted by this change had to be disbanded so they could be form new clusters that adhered to the new criteria.
  7. u_rep_actions: perform initial actions required for modified representatives
    • gtdb_species_clusters u_rep_actions ./6_u_rep_changes/rep_change_summary.tsv gtdb_r<prev>_metadata.tsv gtdb_r<prev>_genome_paths.tsv gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv ./1_u_new_genomes/genomes_new_updated.tsv ./2_u_qc_genomes/qc_passed.tsv ./3_u_gtdbtk/gtdbtk_classify.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/species_priority.tsv ../ledgers/genus_priority.tsv ../ledgers/ncbi_env_bioprojects.tsv lpsn_gss_<date>.csv 7_u_rep_actions -c 64
    • the lpsn_gss_<date>.csv file is obtained from the download section of LPSN website which requires an account and should be updated with each GTDB release
    • GTDB preferentially select isolate genomes as representative, including the replacement of existing representatives. This can cause issues if a number of genomes are incorrectly considered to be isolates which are actually MAGs. To help mitigate this issue, BioProjects resulting in 2 or more representatives being replaced are indicated with a warning (e.g. BioProject PRJNA625506 responsible for 2 representatives being replaced with presumed isolates.). These BioProjects should be looked up on NCBI to verify they do in fact represent isolates. If a BioProject is found to be a set of MAGs it should be added to the ncbi_env_bioprojects.tsv ledger, ideally brought to the attention of NCBI, and this step rerun so it reflects the genomes being MAGs. Please update this list of BioProjects verified to contain isolate genomes.
    • the most common warning produced is in regards to the ambiguity of naming priority (e.g. Ambiguous priority based on publication date.). ALL such cases need to be manually resolved, added to the species_priority.tsv ledger, and this step re-run. This is admittedly time consuming, but necessary if the GTDB is to properly reflect naming priority. The list of ambiguous priorities is given in ambiguous_sp_priority.tsv. Resolving the initial list of ambiguous naming priorities often results in additional cases that need to be resolved.
    • multiple reassignment warnings (e.g. Representative G000297375 was reassigned multiple times) can generally be ignored. These are reported as warnings to allow for a quick inspection and to ensure that this remains an unusual case. The most common situation is for the second reassignment to be due to nomenclatural priority which should always trump any other reason for a reassignment.
    • other warnings should be investigated, but generally indicate challenging situations which are resolved adequately
  8. u_sel_reps: select representatives for all named species at NCBI
    • gtdb_species_clusters u_sel_reps ./7_u_rep_actions/updated_sp_clusters.tsv gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/ncbi_untrustworthy_sp_assignments.tsv ../ledgers/species_priority.tsv ../ledgers/genus_priority.tsv ../ledgers/ncbi_env_bioprojects.tsv lpsn_gss_<date>.csv 8_u_sel_reps -c 64
    • warnings should be investigated, but generally indicate challenging situations which are resolved adequately. The most common warning is regarding multiple species clusters with the same name (e.g. s__Pseudarthrobacter sulfonivorans is represented by both [('G001484605', False)] and [('G014712225', True)]). These are reported so these species can be sanity checked at the end of the species clustering workflow, but should ultimately be resolved correctly. This situation occurs since at this stage in the pipeline species clusters do not have finalized names so any new introduction of type material can result in the duplicate use of names.
    • this method currently hard codes the fact that Shigella species are considered synonyms of E. coli in GTDB as of release R207 (this should probably be moved to a ledger)
  9. u_cluster_named_reps: cluster genomes to selected GTDB representatives
    • gtdb_species_clusters u_cluster_named_reps ./8_u_sel_reps/gtdb_named_reps_final.tsv gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ./8_u_sel_reps/gtdb_rep_pairwise_ani.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/ncbi_env_bioprojects.tsv 9_u_cluster_named_reps -c 96
    • warning regarding ANI neigbhours having an ANI slightly >97% (usual threshold for merging) should be monitored (10 cases in R202; 15 cases in R207; 13 cases in R226), but are expected. A "fudge factor" of 0.1% ANI was put in place to stop previous GTDB representatives from being merged when they are extremely close to the 97% boundary in order to account for small changes to genomes and any differences between FastANI versions. If the reported ANI value is >97.1% the situation should be investigated and likely the species should be merged at this point.
    • the "Identified N genome pairs meeting ANI radius criteria, but with an AF <0.5" is expected and should be monitored. The AF criterion was changed from 0.65 to 0.5 starting in R207. These represent cases where perhaps genomes should be merged into a single species, but ultimately the criteria is a bit arbitrary. So long as this number remains low (<400 in R202; 20 in R207 with AF=0.5 criterion) it isn't critical though a formal exploration of the best AF criteria to use would be interesting.
  10. u_cluster_de_novo: infer de novo species clusters and representatives for remaining genomes
    • gtdb_species_clusters u_cluster_de_novo ./9_u_cluster_named_reps/gtdb_named_rep_clusters.tsv gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ./9_u_cluster_named_reps/ani_af_nonrep_vs_rep.pkl ../ledgers/gtdb_type_strains.tsv ../ledgers/ncbi_env_bioprojects.tsv 10_u_cluster_de_novo -c 96
    • no warnings are expected
  11. u_cluster_stats: summary statistics indicating changes to GTDB species cluster membership
    • gtdb_species_clusters u_cluster_stats ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/ncbi_env_bioprojects.tsv 11_u_cluster_stats
    • the logging output should be consulted to verify that few genomes have migrated between species clusters (<1%) and that few species representatives have been changed (<5%), lost (<5%), or merged (<1%). It is critical that genomes do not "bounce" between different species clusters with each GTDB release and that the GTDB representatives genomes are largely stable (though, not 100% stable since we replace representatives when suitably higher quality genomes become available). If these values are high it may indicate an issue with the workflow or some major change to the set of genomes at NCBI which needs to be investigated.
  12. u_ncbi_erroneous: identify genomes with erroneous NCBI species assignments under the GTDB
    • gtdb_species_clusters u_ncbi_erroneous ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv gtdb_r<cur>_metadata.tsv gtdb_r<cur>_genome_paths.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/ncbi_untrustworthy_sp_assignments.tsv ../ledgers/ncbi_env_bioprojects.tsv 12_u_ncbi_erroneous
    • an unrecoverable error will be produced whenever there are multiple effective type strain genomes from the same species in multiple species clusters. In these cases, it is unclear which cluster represents the actual species and thus how erroneous NCBI classifications should be established. These should all be species that couldn't be resolved by u_resolve_types and thus appear in the unresolved_type_strain_genomes.tsv file. These cases need to be resolved at this time by adding all offending genomes to the ncbi_untrustworthy_sp_assignments ledger as these genomes represent multiple species making the NCBI assignment to a single species untrustworthy.
    • a large number of genomes are identified as misclassified at NCBI as the GTDB and NCBI have different species delineating criteria:
      • R207: 19,880 genomes from 1,738 species
      • R214: 25,736 genomes from 2,096 species
      • R220: 35,820 genomes from 2,534 species
      • R226: 43,417 genomes from 3,016 species
  13. u_synonyms: determine synonyms for validly or effectively published species
    • gtdb_species_clusters u_synonyms ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv gtdb_r<cur>_metadata.tsv ./2_u_qc_genomes/qc_passed.tsv ./12_u_ncbi_erroneous/ncbi_misclassified_sp.gtdb_clustering.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ./9_u_cluster_named_reps/ani_af_nonrep_vs_rep.pkl ../ledgers/gtdb_type_strains.tsv ../ledgers/species_priority.tsv ../ledgers/genus_priority.tsv ../ledgers/ncbi_untrustworthy_sp_assignments.tsv ../ledgers/ncbi_env_bioprojects.tsv lpsn_gss_<date>.csv 13_u_synonyms
    • no warnings are expected
    • R214:
      • 479 GTDB representatives resulting in 571 type strain synonyms
      • 173 GTDB representatives resulting in 192 majority vote synonyms
    • R220:
      • 570 GTDB representatives resulting in 663 type strain synonyms
      • 146 GTDB representatives resulting in 165 majority vote synonyms
    • R226:
      • 652 GTDB representatives resulting in 775 type strain synonyms
      • 144 GTDB representatives resulting in 164 majority vote synonyms
  14. u_curation_trees: produce curation trees highlighting new NCBI taxa; GTDB species cluster with >= 1 genomes assigned to a new NCBI taxa are highlighted
  • gtdb_species_clusters u_curation_trees ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv ./2_u_qc_genomes/qc_passed.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/ncbi_untrustworthy_sp_assignments.tsv ../ledgers/ncbi_env_bioprojects.tsv 14_u_curation_trees --output_prefix gtdb_r<cur>
    • no warnings are expected
    • these trees need to be provided to the GTDB curators to help them identify changes between releases
  1. u_species_init: produce initial best guess at names for GTDB species clusters
    • gtdb_species_clusters u_species_init ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv ./2_u_qc_genomes/qc_passed.tsv ./3_u_gtdbtk/gtdbtk_classify.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ./13_u_synonyms/synonyms.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/species_priority.tsv ../ledgers/genus_priority.tsv ../ledgers/ncbi_untrustworthy_sp_assignments.tsv ../ledgers/ncbi_env_bioprojects.tsv lpsn_gss_<date>.csv 15_u_species_init
    • Important: this method makes use of historical GTDB assignments to refine results and the GTDB taxonomy files for all previous releases need to be in the /srv/projects/gtdb/data/taxonomy_gtdb directory.
    • this is the most questionable part of the GTDB species cluster workflow as correctly assigning names at this stage is challenging since the curators have not yet establish the final genus names. As such, this step is a best guess at the names which are then further refined in the next step which implements a set of post-curation rules. Curators need the majority of species names to be correctly establish to help with curation, but genus curation needs to be done before final species names can be established. A chicken and egg situation, and the current solution works but is not ideal.
    • warnings indicating conflicts between GTDB ledgers and NCBI assignments are expected. As of R207, this was for Clavibacter species, Nanopusillus stetteri, Mycobacterium abscessus, and Agathobacter rectalis which are all correct. This should be monitored to confirm that our ledgers are correct. Masha is the best person to verify that these warnings are acceptable.
    • warnings regarding species name of type strain representatives already being in use are acceptable (5 cases in R207)
    • other warnings are also acceptable and are provided with the idea that intuition may indicate that something is off, but names will be refined in the next step
  2. u_pmc_species_names: refine species names using post-manual curation rules
    • gtdb_species_clusters u_pmc_species_names ./15_u_species_init/gtdb_ar_taxonomy.tsv manual_sp_file.tsv ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv ./2_u_qc_genomes/qc_passed.tsv ./12_u_ncbi_erroneous/ncbi_misclassified_sp.gtdb_clustering.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ./13_u_synonyms/synonyms.tsv ./7_u_rep_actions/updated_species_reps.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/species_classification.tsv ../ledgers/species_priority.tsv ../ledgers/genus_priority.tsv ../ledgers/specific_epithet_transfer_map_ar.tsv ../ledgers/ncbi_untrustworthy_sp_assignments.tsv ../ledgers/ncbi_env_bioprojects.tsv lpsn_gss_<date>.csv 16_u_pmc_species_names_ar
    • gtdb_species_clusters u_pmc_species_names ./15_u_species_init/gtdb_bac_taxonomy.tsv manual_sp_file.tsv ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv gtdb_r<prev>_metadata.tsv gtdb_r<cur>_metadata.tsv ./2_u_qc_genomes/qc_passed.tsv ./12_u_ncbi_erroneous/ncbi_misclassified_sp.gtdb_clustering.tsv assembly_summary_genbank.txt ./5_u_resolve_types/untrustworthy_type_material.tsv ./13_u_synonyms/synonyms.tsv ./7_u_rep_actions/updated_species_reps.tsv ../ledgers/gtdb_type_strains.tsv ../ledgers/species_classification.tsv ../ledgers/species_priority.tsv ../ledgers/genus_priority.tsv ../ledgers/specific_epithet_transfer_map_bac.tsv ../ledgers/ncbi_untrustworthy_sp_assignments.tsv ../ledgers/ncbi_env_bioprojects.tsv lpsn_gss_<date>.csv 16_u_pmc_species_names_bac
    • the manual_sp_file.tsv is used to indicate names that should be set based on manual curation and is used to resolve any warnings/errors produced by this method (see previous GTDB release to see format of this file); this file needs to exist, but can be empty if no manual curation is required at this point
    • all warnings should be understood and deemed acceptable
    • all errors should be resolved. In particular, all GTDB species clusters must have a unique name. Violations of this requirement are indicated by errors of the form Species name <name> assigned to at least 2 GTDB representatives. This can be resolved by adding genomes to the manual_sp_file.tsv. It is often helpful to look at the previous GTDB classification of these genomes in order to resolve these violations.
    • species names do not need to be 100% correct at this point since this is done through the post-manual curation (PMC) workflow that is conducted once GTDB curators have finalized genus names. However, GTDB curators use these initial species names to aid in the manual curation so a reasonable effort should be made at this point.
    • the file gtdb_sp_clusters.ncbi_sp.tsv should be passed along to the curation team
    • the file species_classification_ledger_updates.tsv should be inspected and cases resulting in a name change evaluated to see if they are still valid or is the species classification ledger needs modification

The new GTDB representative genomes and updated species clustering information should be added to the GTDB after completing the above workflow. This can be done using the GTDB Migration Toolkit:

gtdb_migration_tk update_reps_db --hostname <hostname> -u <user> -p <password> -d <database> --final_cluster_file ./10_u_cluster_de_novo/gtdb_clusters_de_novo.tsv

A new metadata file with the updated GTDB species clustering should then be dumped from the database and this metadata file used for any subsequent processing:

gtdb metadata export --format tab --output gtdb_r<cur>_metadata.updated_reps.<date>.tsv

Manual curation

Manual curation is performed after updating the GTDB species clusters. The following information should be provided to curators by creating a directory in /srv/projects/gtdb/{release}/{domain}/pre_curation/bac120/{date} which contains the following files:

  • trees produced by 14_u_curation_trees in the directory ncbi_new_taxa_trees
  • bacterial and archaeal snapshot trees and associated files: gtdb_<release>_bac120.decorated.tree, gtdb_<release>_bac120.decorated.tree-table, gtdb_<release>_bac120.decorated.tree-summary, gtdb_<release>_bac120.decorated.tree-taxonomy
  • results of phylorank outliers in the directory phylorank_outliers
  • ARB database and filter file: gtdb_r202_bac120_arb_metadata.txt, gtdb_r202_bac120_arb_filter.ift

Analogous data needs to be put together for Archaea. Inferring the archaea tree typically takes much longer since we use iqtree instead of fasttree. Trees with alternative marker sets (e.g., rp2) also need to be put together, but these are far less critical than the bac120 and ar122 trees since these are the primary trees used for curation.

Post-manual curation validation of species clusters

Any ledgers updated as part of manual curation should be updated a given the name <ledger>.post_curation.tsv (e.g. gtdb_type_strains.post_curation.tsv) and used during this final validation. Ultimately, the final GTDB taxonomy for a release should reflect any updates to ledgers.

After manual curation, GTDB species names must be finalized and validated by running all the commands beginning with a pmc in the order specified in the CLI help menu for the GTDB Species Cluster Toolkit. Validation results should be performed in the directory /srv/projects/gtdb/<gtdb_release>/<domain>/post_curation/<date> as follows:

  1. gtdb_species_clusters pmc_manual_species taxonomy.pre_curation.tsv gtdb_r220_ar53_curation_scaled_20240326.tree 1_pmc_manual_species
    • taxonomy.pre_curation.tsv is the final pre-curation taxonomy file in 16_u_pmc_species_names_ar/final_taxonomy.tsv or 16_u_pmc_species_names_bac/final_taxonomy.tsv
  2. gtdb_species_clusters pmc_replace_generic ./1_pmc_manual_species/manual_species_names.tsv gtdb_r220_ar53_curation_taxonomy_20240326.tsv 2_pmc_replace_generic
  3. gtdb_species_clusters pmc_species_names gtdb_r220_ar53_curation_scaled_20240326.tree ./1_pmc_manual_species/manual_species_names.tsv ./2_pmc_replace_generic/taxonomy_updated_sp.tsv pmc_custom_species.tsv gtdb_clusters_de_novo.tsv gtdb_r214_metadata.tsv gtdb_r220_metadata.updated_reps.20231114.tsv qc_passed.tsv ncbi_misclassified_sp.gtdb_clustering.tsv assembly_summary_genbank.txt untrustworthy_type_material.tsv synonyms.tsv updated_species_reps.tsv ./ledgers/gtdb_type_strain.post_curation.tsv ./ledgers/species_classification.tsv ./ledgers/species_priority.tsv ./ledgers/genus_priority.tsv ./ledgers/specific_epithet_transfer_map_ar.tsv ./ledgers/ncbi_untrustworthy_sp_assignments.tsv ./ledgers/ncbi_env_bioprojects.tsv lpsn_gss_2023-09-12.csv 3_pmc_species_names
    • going through the issues shown in ALLCAPS (e.g. INCONGRUENT_TYPE_STRAIN) has generally revealed issues
    • any reported errors should be manually resolved by adding desired species name to pmc_custom_species.tsv
    • I have also been manually resolving all Finalized GTDB species name warnings by adding cases to pmc_custom_species.tsv though this may not be necessary - more investigation is needed
    • specific_epithet_map.new_cases.tsv should be sent to curation team (Masha) and results added to the specific_epithet_transfer_map ledger
    • the Species classification ledger warnings should also be brought to the attention of the curation team (Masha) to see if the species classification ledger for these cases is still valid
    • this step must be repeated with the updated specific_epithet_transfer_map ledger and after resolving of any issues identified in steps 4, 5, or 6
    • the taxonomy is finalized once this step and steps 4, 5, or 6 pass without any reported errors or all such errors are considered acceptable
  4. gtdb_species_clusters pmc_check_type_species ./3_pmc_species_names/final_taxonomy.tsv gtdb_r220_metadata.updated_reps.20231114.tsv qc_passed.tsv assembly_summary_genbank.txt untrustworthy_type_material.tsv ./ledgers/gtdb_type_strain.post_curation.tsv ./ledgers/species_priority.tsv ./ledgers/genus_priority.tsv ./ledgers/ncbi_env_bioprojects.tsv lpsn_gss_2023-09-12.csv 4_pmc_check_type_species
    • amiguous_sp_priority.tsv should be empty or any reported cases manually verified
    • type_species_incongruencies.tsv should be empty or any reported cases manually verified
    • TBD: the species_classification.<date>.tsv ledger should be considered to remove known cases of incongruence between GTDB and NCBI
    • TBD: the pmc_custom_species.tsv should be considered to remove cases that were manually set and thus presumably correct
  5. gtdb_species_clusters pmc_check_type_strains ./3_pmc_species_names/final_taxonomy.tsv gtdb_r220_metadata.updated_reps.20231114.tsv qc_passed.tsv assembly_summary_genbank.txt untrustworthy_type_material.tsv ./ledgers/gtdb_type_strain.post_curation.tsv ./ledgers/ncbi_env_bioprojects.tsv 5_pmc_check_type_strains
    • send type_species_incongruencies.tsv from previous step and type_strains_incongruencies.tsv to curation team (Masha) for manual inspection as ideally these files should be empty
    • TBD: the species_classification.<date>.tsv ledger should be considered to remove known cases of incongruence between GTDB and NCBI
    • TBD: the pmc_custom_species.tsv should be considered to remove cases that were manually set and thus presumably correct
  6. Taxonomy should be verified by running:
    • gtdb_validation_tk check_file ./3_pmc_species_names/final_taxonomy.tsv --include_species
    • gtdb_validation_tk check_generic ./3_pmc_species_names/final_taxonomy.tsv gtdb_r220_metadata.updated_reps.20231114.tsv ./ledgers/species_classification.tsv NONE ./3_pmc_species_names/check_generic
      • unsupported_generic_names.tsv indicates GTDB genera where it is unclear where the genus name originates from; ideally we would be able to account for all GTDB genera; however, perhaps there is too much historical issues (e.g. names changing at NCBI; changing GTDB rules) for this check to be useful
    • gtdb_validation_tk check_specific ./3_pmc_species_names/final_taxonomy.tsv gtdb_r207_metadata.updated_reps.20211030.tsv ./ledgers/species_classification.tsv synonyms.tsv ./3_pmc_species_names/check_specific
      • Ideally, invalid_specific_names.tsv should be empty or only have a few exceptions due to retired genomes at NCBI, but this is still a work in progress
    • gtdb_species_clusters pmc_validate ./1_pmc_manual_species/manual_species_names.tsv ./3_pmc_species_names/final_taxonomy.tsv gtdb_r220_ar53_curation_scaled_20240326.tree pmc_custom_species.tsv gtdb_clusters_de_novo.tsv gtdb_r214_metadata.tsv gtdb_r220_metadata.updated_reps.20231114.tsv qc_passed.tsv ncbi_misclassified_sp.gtdb_clustering.tsv assembly_summary_genbank.txt untrustworthy_type_material.tsv synonyms.tsv updated_species_reps.tsv ./ledgers/gtdb_type_strain.post_curation.tsv ./ledgers/species_classification.tsv ./ledgers/species_priority.tsv ./ledgers/genus_priority.tsv ./ledgers/specific_epithet_transfer_map_ar.tsv ./ledgers/ncbi_env_bioprojects.tsv lpsn_gss_2023-09-12.csv processed_lpsn_data.tsv ground_truth_test_cases.tsv 6_pmc_validate
      • this runs a lot of different validation tests; ideally, all tests should past, but in practice there are challenging cases that are exceptions so it is more a matter of manually inspecting cases that fail to establish if they are acceptable and improving the code to better handle identified exceptions for the next release
      • modified_specific_names.tsv should be consulted and ALPHANUMERIC -> ALPHANUMERIC, ALPHANUMERIC -> SUFFIXED_LATIN, SUFFIXED_LATIN -> ALPHANUMERIC transitions added to the pmc_custom_species.tsv file as a placeholder name should not replace a placeholder. The SUFFIXED_LATIN -> SUFFIXED_LATIN are harder to verify as a species may be split up or a genome legitimately reassigned to a different species. Steps 3 to 6 should then be repeated to take these new species names into account.
  7. The final taxonomy and curation tree for GTDB representatives is ./3_pmc_species_names/final_taxonomy.tsv and ./3_pmc_species_names/curation_tree.final_species.tree. The final GTDB synonyms file is ./3_pmc_species_names/gtdb_synonyms_final.tsv. These files should be renamed with an appropriate prefix, sent to the curation team, and put in the directory final_data. These files are uses as the bases for creating the final data files for the GTDB website (Pierre: where have we documented the next steps?). Files should be renamed as follows:
    • ./3_pmc_species_names/final_taxonomy.tsv to ./final_data/gtdb_r220_ar53_curation_taxonomy.tsv
    • ./3_pmc_species_names/curation_tree.final_species.tree to ./final_data/gtdb_r220_ar53_curation_scaled.tree
    • ../3_pmc_species_names/gtdb_synonyms_final.tsv to ./final_data/gtdb_r220_ar53_synonyms.tsv
  8. We also need an unscaled tree with the finalized taxonomy which can be obtained by stripping and decorating a tree:
    • gtdb_validation_tk add_dummy NONE --taxonomy_file gtdb_r220_ar53_curation_taxonomy.tsv
    • genometreetk strip ../gtdb_r220_ar53_curation_unscaled_20240326.tree gtdb_r220_ar53_curation_unscaled.stripped.tree
    • phylorank decorate gtdb_r220_ar53_curation_unscaled.stripped.tree gtdb_r220_ar53_curation_taxonomy.dummy_nodes.tsv gtdb_r220_ar53_curation_unscaled.tree --skip_rd_refine
    • Unfortunately, ARB has a tendency to modify branch lengths by a factor of 100 if trees aren't imported and exported correctly. The mean root to leaf branch lengths should be unchanged from the pre-curation, rooted tree. A quick sanity check is to ensure the mean root to leaf branch lengths are in the [~0.5, ~2.0] range using:
      • gtdb_validation_tk branch_len gtdb_r220_ar53_curation_unscaled.tree

Updating of ledgers to reflect final taxonomy

Any manually modified species names which are reported by the pmc_manual_species likely need to be added to an official ledger so these changes can be retained moving forward. The ledger are currently our only mechanism for ensuring a name persists between releases. Otherwise, names are always determined de novo based on NCBI assignments as, in general, we want the GTDB to reflect changing taxonomic opinion as reflected by changes at NCBI.

Suggested improvements to GTDB species workflow

  • ideally all ledgers would be tables in the GTDB database and we would provide curators with a web interface to modify these ledgers