-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
TODO: new command to create taxdump files for MAG genome collections #56
Comments
Supporting stable/persistent TaxIds? So it can be tracked. The TaxId of a genome/assembly is easily computed by hashing the
|
While for taxa at species or above ranks, hierarchical lineage information is needed to make them unique and stable. GTDB
MGVThings are a little complicated.
Where |
Note that, some species do not have complete lineage, e.g., GCA_018897955.1 only has Kingdom, Phylum, and Species.
While GCA_016192455.1 even does not have Phylum.
Two special cases where the Class and Genus have the same name
|
Here's the alpha version: Currently, the Usage
Try it
Taxid changelogThough
Let's see an
So here, |
I think I did it. Please visit: https://github.com/shenwei356/gtdb-taxdump Please try the new version: https://github.com/shenwei356/taxonkit/releases/tag/v0.11.0-alpha |
Is it possible to create a "custom" taxdump where the taxids are strings and the taxonomy info includes just class, order, family, genus, and species? |
You can define whatever rank you want.
That would not be the taxdump files. |
I've combined a bunch of different databases together but some do not have
Unfortunately, some have missing fields for different taxonomic levels. |
It's not a problem.
It's OK. See the last example.
Cheers! 🍻 |
Am I doing this correctly? Here is
Now remove header, get source name and lineage, then pipe into
Piping the above into taxonkit:
Is this the correct usage? When I try it with the full dataset, i get this error:
|
Well, it becomes a little bit complex. It seems that the line containing At the first glance of the input data, I thought it had complete information of all ranks. But it seems that domain and phylum are missing.
For rows with lineage in format of "d_xxx,p_xxx", domain and phylum could be extracted. So you can use code I pasted below. If the line of |
Damn, looks like that one is missing a lineage field. What would the command be if I had the following columns: id_source, class, order, family, genus, species, strain (some fields might be empty here but not all fields)
Would |
|
Do I need to specify to create-taxdump what the class, order, family, genus, species, and strain columns are or does it autodetect? |
https://bioinf.shenwei.me/taxonkit/usage/#create-taxdump Please check the usage and example 4.
|
It's similar to gtdb_to_taxdump, but more generalized to support MGV.
The input would be:
The text was updated successfully, but these errors were encountered: