-
Notifications
You must be signed in to change notification settings - Fork 2
Create SILVA SSU mapping file
Donovan Parks edited this page Nov 5, 2018
·
4 revisions
A mapping file is required in order to link the GTDB to results at SILVA. This is done through a TSV file parsed by the folks at SILVA.
This file is generated as follows:
- Identify 16S rRNA genes in all GTDB genomes (not just the dereplicated set)
- Filter 16S rRNA genes that are <1200 bp (<900 bp if archaeal), on a contig <10kb, have a length >2kb, or have 10 or more ambiguous bases. I also filter out genome that have a quality<50 or are comprised of >500 contigs. Overall, this filters out ~50% of the ~250,000 16S rRNA genes identified.
- BLAST the remaining 16S genes against SILVA's Ref database.
- Take hits with 99% identity and 99% alignment length over the shorter of the query and subject genes. This high stringency is needed to ensure correct species assignments. Relaxing this doesn't result in appreciably more assignments.
- Filter hits from the ~500 genomes marked as contaminated by EstCont16S (this is taken directly from the EzBioCloud website), or that fail the IDTAXA tests I have developed. This filtering reducing the number of SILVA 16S rRNA genes with a GTDB taxonomy assignment by <100.
- If a given SILVA gene has multiple hits with incongruent GTDB assignments, do a majority vote to determine the GTDB taxonomy string. This occurs in 1,047 of 22,240 cases. This is mostly cases where one hit indicates a specific GTDB species, but the hit has no species assignment (i.e., s__). There are all sorts of reasons this situation could occur so I think a majority vote is the safest approach.
- validate
- add URL
The final mapping file should be placed in /srv/home/ftp/public/gtdb/releaseXX/silva_mapping_rXX.tsv and the link to the latest SILVA mapping file updated in /srv/home/ftp/public/gtdb/. This linked file is used by the SILVA team, i.e.: