Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

TODO: new command to create taxdump files for MAG genome collections #56

Closed
shenwei356 opened this issue Apr 18, 2022 · 15 comments
Closed

Comments

@shenwei356
Copy link
Owner

shenwei356 commented Apr 18, 2022

It's similar to gtdb_to_taxdump, but more generalized to support MGV.

The input would be:

genome-id    species-id   kingdom  phylum   class   order  family  genus  species
---------    ----------   ------------------------------------------------------
needed       optional     optional

# At least one of the species-id and lineage is needed
@shenwei356 shenwei356 changed the title TODO: new command to create fake-taxids for MAG genome collections TODO: new command to create taxdump files for MAG genome collections Apr 18, 2022
@shenwei356
Copy link
Owner Author

shenwei356 commented Apr 18, 2022

Supporting stable/persistent TaxIds? So it can be tracked.

The TaxId of a genome/assembly is easily computed by hashing the genome_id to a uint32.

# GTDB
# assembly accessions are stable in NCBI
GCA_000016605.1 -> hash(GCA_000016605) -> uint32
GCF_000011005.1 -> hash(GCF_000011005) -> uint32

# MGV
MGV-GENOME-0364295  -> hash(MGV-GENOME-0364295) or hash(0364295)

# GPD
ivig_1   -> hash(ivig_1)
uvig_6   -> hash(uvig_6)

# HumGut provides id to NCBI/GTDB mapping file.

@shenwei356
Copy link
Owner Author

shenwei356 commented Apr 18, 2022

While for taxa at species or above ranks, hierarchical lineage information is needed to make them unique and stable.

GTDB

$ zcat bac120_taxonomy_r202.tsv.gz \
    | csvtk cut -Ht -f 2 \
    | csvtk uniq -Ht \
    | head -n 5
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri
d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella;s__Klebsiella pneumoniae
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pneumoniae

MGV

Things are a little complicated.

We assigned viruses to families from the ICTV data-
base based on alignments to genomes from NCBI GenBank and
crAss-like viruses from recent studies34,45,46 (Fig. 1d). Only 56.6%
of viruses could be annotated at the family level, confirming a
large knowledge gap in the taxonomy of human gut viruses

$ csvtk cut -t -F -f ictv_* mgv_contig_info.tsv \
    | csvtk uniq -t -F -f ictv_* \
    | head -n 6 \
    | csvtk pretty -t 
ictv_order     ictv_family    ictv_genus
------------   ------------   ----------
Caudovirales   crAss-phage    NULL
Caudovirales   Siphoviridae   NULL
Caudovirales   NULL           NULL
Caudovirales   Myoviridae     NULL
Caudovirales   Myoviridae     Lilyvirus

Where crAss-phage is not found in ICTV.

@shenwei356
Copy link
Owner Author

shenwei356 commented Apr 19, 2022

Note that, some species do not have complete lineage, e.g., GCA_018897955.1 only has Kingdom, Phylum, and Species.

GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

While GCA_016192455.1 even does not have Phylum.

GB_GCA_016192455.1      d__Bacteria;p__JACPUC01;c__JACPUC01;o__JACPUC01;f__JACPUC01;g__JACPUC01;s__JACPUC01 sp016192455

Two special cases where the Class and Genus have the same name B47-G6, and the Order and Family between them have different names.

https://gtdb.ecogenomic.org/genome?gid=GCA_003663585.1
GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

https://gtdb.ecogenomic.org/genome?gid=GCA_003663565.1
GB_GCA_003663565.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663565

@shenwei356
Copy link
Owner Author

shenwei356 commented Apr 19, 2022

Here's the alpha version:

Currently, the merged.dmp and delnodes.dmp are empty, It needs to be computed by comparing two versions of the GTDB taxdump. I'll work on it tomorrow.

gtdb.r207.taxdump.zip

Usage
$ taxonkit create-taxdump  -h
Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB

Input format: 
  0. For GTDB taxonomy file, just use --gtdb
  1. The input file should be tab-delimited
  2. At least one column is needed, please specify the filed index:
     1) Kingdom/Superkingdom/Domain,     -K/--field-kingdom
     2) Phylum,                          -P/--field-phylum
     3) Class,                           -C/--field-class
     4) Order,                           -O/--field-order
     5) Family,                          -F/--field-family
     6) Genus,                           -G/--field-genus
     7) Species (needed),                -S/--field-species
     8) Subspecies,                      -T/--field-subspecies
        For GTDB, we use the assembly accession (without version number).

Attentions:
  1. Names should be distinct in taxa of different rank.
     But for these missing some taxon nodes, using names of parent nodes is allowed:

       GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

     It can also detect duplicate names with different ranks, e.g.,
     The Class and Genus have the same name B47-G6, and the Order and Family between them have different names.
     In this case, we reassign TaxId by increasing the TaxId until it being distinct.

       GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

Usage:
  taxonkit create-taxdump [flags] 

Flags:
  -C, --field-class int        field index of class
  -F, --field-family int       field index of family
  -G, --field-genus int        field index of genus
  -K, --field-kingdom int      field index of kingdom
  -O, --field-order int        field index of order
  -P, --field-phylum int       field index of phylum
  -S, --field-species int      field index of species (needed)
  -T, --field-subspecies int   field index of subspecies
      --force                  overwrite existed output directory
      --gtdb                   input files are GTDB taxonomy file
      --gtdb-re-subs string    regular expression to extract accession as the subspecies from the
                               assembly ID (default "^\\w\\w_(.+)\\.\\d+$")
  -h, --help                   help for create-taxdump
      --line-chunk-size int    number of lines to process for each thread, and 4 threads is fast enough.
                               (default 5000)
      --null strings           null value of taxa (default [,NULL,NA])
      --out-dir string         output directory
      --rank-names strings     names of the 8 ranks, order maters (default
                               [superkingdom,phylum,class,order,family,genus,species,no rank])

Try it

$ echo Escherichia coli | taxonkit name2taxid --data-dir gtdb.r207.taxdump 
Escherichia coli        4093283224

$ echo 4093283224 | taxonkit lineage --data-dir gtdb.r207.taxdump -r 
4093283224      Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli    species

$ taxonkit list --data-dir gtdb.r207.taxdump --ids 4093283224 -r -n | head -n 5
4093283224 [species] Escherichia coli
  61542 [no rank] GCA_000176655
  167432 [no rank] GCF_002458255
  477667 [no rank] GCA_000193895
  502475 [no rank] GCF_910592735

Taxid changelog

Though merged.dmp and delnodes.dmp are not available right now, we can still generate taxid-changelog:

$ taxonkit taxid-changelog -i . -o gtdb-taxid-changelog.csv.gz --verbose

Let's see an Escherichia flexneri; assembly which should be merged into the Escherichia coli species (https://forum.gtdb.ecogenomic.org/t/announcing-gtdb-r07-rs207/264).

$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep  -f taxid -p 167432
167432,gtdb.r202.taxdump,NEW,,GCF_002458255,no rank,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri;GCF_002458255,609216830;3788559933;329474883;3160438580;2234733759;3334977531;3912920909;167432
167432,gtdb.r207.taxdump,CHANGE_LIN_TAX,,GCF_002458255,no rank,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;GCF_002458255,609216830;3788559933;329474883;3160438580;2234733759;3334977531;4093283224;167432

So here, 3912920909 is merged into 4093283224.

@shenwei356
Copy link
Owner Author

shenwei356 commented Apr 21, 2022

@jolespin
Copy link

Is it possible to create a "custom" taxdump where the taxids are strings and the taxonomy info includes just class, order, family, genus, and species?

@shenwei356
Copy link
Owner Author

taxonomy info includes just class, order, family, genus, and species

You can define whatever rank you want.

taxdump where the taxids are strings

That would not be the taxdump files.
Sounds like GTDB taxonomy file?
You might replace the "taxid" with something else?

@jolespin
Copy link

jolespin commented Nov 19, 2023

I've combined a bunch of different databases together but some do not have ncbi_taxid fields. Here's an example of what my table looks like:

id_source	dataset	ncbi_taxid	lineage	class	order	family	genus	species	strain	notes	resolved_lineage
Aalte1	MycoCosm	5599	d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata	Dothideomycetes	Pleosporales	Pleosporaceae	Alternaria	Alternaria alternata			True
Aaoar1	MycoCosm	1450171	d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii	DothideomycetesPleosporales		Aaosphaeria	Aaosphaeria arxii			False
Abobi1	MycoCosm	137743	d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis	Agaricomycetes	Polyporales	Podoscyphaceae	Abortiporus	Abortiporus biennis			True
Abobie1	MycoCosm	137743	d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis	Agaricomycetes	Polyporales	Podoscyphaceae	Abortiporus	Abortiporus biennis			True
Abscae1	MycoCosm	90261	d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea	Mucoromycetes	Mucorales	Cunninghamellaceae	Absidia	Absidia caerulea			True
Absrep1	MycoCosm	90262	d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens	Mucoromycetes	Mucorales	Cunninghamellaceae	Absidia	Absidia repens			True
Acain1	MycoCosm	215250	d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii	Exobasidiomycetes	Exobasidiales	Cryptobasidiaceae	Acaromyces	Acaromyces ingoldii			True
Acastr1	MycoCosm	1307806	d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata	Lecanoromycetes	Acarosporales	Acarosporaceae	Acarospora	Acarospora strigata			True
Acema1	MycoCosm	886606	d__Eukaryota;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Mollisiaceae;g__Acephala;s__Acephala macrosclerotiorum	Leotiomycetes	Helotiales	Mollisiaceae	Acephala	Acephala macrosclerotiorum			True

Unfortunately, some have missing fields for different taxonomic levels.

@shenwei356
Copy link
Owner Author

shenwei356 commented Nov 19, 2023

some do not have ncbi_taxid fields

It's not a problem.

Unfortunately, some have missing fields for different taxonomic levels.

It's OK. See the last example.

$ cat data.tsv \
    | csvtk fix -t \
    | csvtk cut -t -f class,order,family,genus,species \
    | csvtk pretty -t
    
class                         order           family               genus               species                   
---------------------------   -------------   ------------------   -----------------   --------------------------
Dothideomycetes               Pleosporales    Pleosporaceae        Alternaria          Alternaria alternata      
DothideomycetesPleosporales                   Aaosphaeria          Aaosphaeria arxii                             
Agaricomycetes                Polyporales     Podoscyphaceae       Abortiporus         Abortiporus biennis       
Agaricomycetes                Polyporales     Podoscyphaceae       Abortiporus         Abortiporus biennis       
Mucoromycetes                 Mucorales       Cunninghamellaceae   Absidia             Absidia caerulea          
Mucoromycetes                 Mucorales       Cunninghamellaceae   Absidia             Absidia repens            
Exobasidiomycetes             Exobasidiales   Cryptobasidiaceae    Acaromyces          Acaromyces ingoldii       
Lecanoromycetes               Acarosporales   Acarosporaceae       Acarospora          Acarospora strigata       
Leotiomycetes                 Helotiales      Mollisiaceae         Acephala            Acephala macrosclerotiorum


$ cat data.tsv \
    | csvtk fix -t \
    | csvtk cut -t -f species \
    | taxonkit --data-dir test name2taxid \
    | taxonkit --data-dir test lineage -r -i 2
    | csvtk rename -t -f 1-4 -n query,taxid,lineage,rank \
    | csvtk pretty -t -W 40 -x ';'
    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ query                      ┃ taxid      ┃ lineage                                  ┃ rank    ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Alternaria alternata       ┃ 1551135023 ┃ Dothideomycetes;Pleosporales;            ┃ species ┃
┃                            ┃            ┃ Pleosporaceae;Alternaria;                ┃         ┃
┃                            ┃            ┃ Alternaria alternata                     ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Abortiporus biennis        ┃ 1095247865 ┃ Agaricomycetes;Polyporales;              ┃ species ┃
┃                            ┃            ┃ Podoscyphaceae;Abortiporus;              ┃         ┃
┃                            ┃            ┃ Abortiporus biennis                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Abortiporus biennis        ┃ 1095247865 ┃ Agaricomycetes;Polyporales;              ┃ species ┃
┃                            ┃            ┃ Podoscyphaceae;Abortiporus;              ┃         ┃
┃                            ┃            ┃ Abortiporus biennis                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Absidia caerulea           ┃ 2044322620 ┃ Mucoromycetes;Mucorales;                 ┃ species ┃
┃                            ┃            ┃ Cunninghamellaceae;Absidia;              ┃         ┃
┃                            ┃            ┃ Absidia caerulea                         ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Absidia repens             ┃ 2093321634 ┃ Mucoromycetes;Mucorales;                 ┃ species ┃
┃                            ┃            ┃ Cunninghamellaceae;Absidia;              ┃         ┃
┃                            ┃            ┃ Absidia repens                           ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acaromyces ingoldii        ┃ 1276596547 ┃ Exobasidiomycetes;Exobasidiales;         ┃ species ┃
┃                            ┃            ┃ Cryptobasidiaceae;Acaromyces;            ┃         ┃
┃                            ┃            ┃ Acaromyces ingoldii                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acarospora strigata        ┃ 643580302  ┃ Lecanoromycetes;Acarosporales;           ┃ species ┃
┃                            ┃            ┃ Acarosporaceae;Acarospora;               ┃         ┃
┃                            ┃            ┃ Acarospora strigata                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acephala macrosclerotiorum ┃ 1403990922 ┃ Leotiomycetes;Helotiales;Mollisiaceae;   ┃ species ┃
┃                            ┃            ┃ Acephala;Acephala macrosclerotiorum      ┃         ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━┛


$ echo Aaosphaeria arxii \
    | taxonkit --data-dir test name2taxid \
    | taxonkit --data-dir test reformat -I 2 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
    | csvtk add-header -t -n query,taxid,kindom,phylum,class,order,family,genus,species \
    | csvtk pretty -t
    
query               taxid        kindom   phylum   class                         order   family        genus               species
-----------------   ----------   ------   ------   ---------------------------   -----   -----------   -----------------   -------
Aaosphaeria arxii   1933378114                     DothideomycetesPleosporales           Aaosphaeria   Aaosphaeria arxii 

Cheers! 🍻

@jolespin
Copy link

jolespin commented Dec 12, 2023

Am I doing this correctly?

Here is test.tsv:

gzip -d -c source_taxonomy.tsv.gz | head -n 10 > test.tsv
id_source	dataset	ncbi_taxid	lineage	class	order	family	genus	species	strain	notes	resolved_lineage
Aalte1	MycoCosm	5599	d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata	Dothideomycetes	Pleosporales	Pleosporaceae	Alternaria	Alternaria alternata			True
Aaoar1	MycoCosm	1450171	d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii	Dothideomycetes	Pleosporales		Aaosphaeria	Aaosphaeria arxii			False
Abobi1	MycoCosm	137743	d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis	Agaricomycetes	Polyporales	Podoscyphaceae	Abortiporus	Abortiporus biennis			True
Abobie1	MycoCosm	137743	d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis	Agaricomycetes	Polyporales	Podoscyphaceae	Abortiporus	Abortiporus biennis			True
Abscae1	MycoCosm	90261	d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea	Mucoromycetes	Mucorales	Cunninghamellaceae	Absidia	Absidia caerulea			True
Absrep1	MycoCosm	90262	d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens	Mucoromycetes	Mucorales	Cunninghamellaceae	Absidia	Absidia repens			True
Acain1	MycoCosm	215250	d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii	Exobasidiomycetes	Exobasidiales	Cryptobasidiaceae	Acaromyces	Acaromyces ingoldii		True
acanthamoeba_castellanii_str_neff_gca_000313135	EnsemblProtists	1257118	d__Eukaryota;p__Discosea;c__;o__Longamoebia;f__Acanthamoebidae;g__Acanthamoeba;s__Acanthamoeba castellanii		Longamoebia	Acanthamoebidae	Acanthamoeba	Acanthamoeba castellanii	Acanthamoeba castellanii str. Neff		False
Acastr1	MycoCosm	1307806	d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata	Lecanoromycetes	Acarosporales	Acarosporaceae	Acarospora	Acarospora strigata			True

Now remove header, get source name and lineage, then pipe into taxonkit create-taxdump:

Aalte1	d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata
Aaoar1	d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii
Abobi1	d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis
Abobie1	d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis
Abscae1	d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea
Absrep1	d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens
Acain1	d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii
acanthamoeba_castellanii_str_neff_gca_000313135	d__Eukaryota;p__Discosea;c__;o__Longamoebia;f__Acanthamoebidae;g__Acanthamoeba;s__Acanthamoeba castellanii
Acastr1	d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata

Piping the above into taxonkit:

cat test.tsv | tail -n +2 | cut -f1,4 | taxonkit create-taxdump --gtdb -O test_output/
17:37:32.975 [WARN] --gtdb-re-subs failed to extract ID for subspecies, the origninal value is used instead. e.g., Acastr1
17:37:32.978 [INFO] 9 records saved to test_output/taxid.map
17:37:32.978 [INFO] 47 records saved to test_output/nodes.dmp
17:37:32.979 [INFO] 47 records saved to test_output/names.dmp
17:37:32.979 [INFO] 0 records saved to test_output/merged.dmp
17:37:32.979 [INFO] 0 records saved to test_output/delnodes.dmp

Is this the correct usage?

When I try it with the full dataset, i get this error:

gzip -d -c source_taxonomy.tsv.gz | tail -n +2 | cut -f1,4 | taxonkit create-taxdump --gtdb -O taxdump/
17:38:53.517 [ERRO] invalid GTDB taxonomy record: MicG_I_3

@shenwei356
Copy link
Owner Author

Well, it becomes a little bit complex.

It seems that the line containing MicG_I_3 is not in the format of d_xx;p_xxx. Is it from other sources?

At the first glance of the input data, I thought it had complete information of all ranks. But it seems that domain and phylum are missing.

$ csvtk headers -t t.tsv
id_source
dataset
ncbi_taxid
lineage
class
order
family
genus
species
strain
notes
resolved_lineage

For rows with lineage in format of "d_xxx,p_xxx", domain and phylum could be extracted. So you can use code I pasted below. If the line of MicG_I_3 do not have enough lineage information, it's bad.

@jolespin
Copy link

Damn, looks like that one is missing a lineage field.

What would the command be if I had the following columns:

id_source, class, order, family, genus, species, strain (some fields might be empty here but not all fields)

gzip -d -c source_taxonomy.tsv.gz | tail -n +2 | cut -f1,5,6,7,8,9,10 | taxonkit create-taxdump

Would -A 0 and -R be the next args?

@shenwei356
Copy link
Owner Author

zcat source_taxonomy.tsv.gz \
    | csvtk fix -t \
    | csvtk cut -t -f id_source,class,order,family,genus,species,strain \
    | taxonkit create-taxdump -A 1 -O taxdump

@jolespin
Copy link

Do I need to specify to create-taxdump what the class, order, family, genus, species, and strain columns are or does it autodetect?

@shenwei356
Copy link
Owner Author

https://bioinf.shenwei.me/taxonkit/usage/#create-taxdump

Please check the usage and example 4.

Input formats:

  2. Ranks can be given either via the first row or the flag --rank-names.

Flags:

  -R, --rank-names strings          names of all ranks, leave it empty to use the first row of input as
                                    rank names

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants