Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Apr 26, 2022
1 parent 199d465 commit fc6fa87
Show file tree
Hide file tree
Showing 2 changed files with 160 additions and 1 deletion.
54 changes: 53 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ to create NCBI-style taxdump files for any taxonomy dataset, including GTDB and
* [Download](#download)
* [Results](#results)
+ [Taxon history of *Escherichia coli*](#taxon-history-of-escherichia-coli)
+ [Species of the genus *Escherichia*](#species-of-the-genus-escherichia)
+ [Common manipulations](#common-manipulations)
* [Citation](#citation)
* [Contributing](#contributing)
Expand Down Expand Up @@ -149,7 +150,8 @@ Frequency of species
$ csvtk freq -t -nr -f species taxid.map.stats.tsv \
> taxid.map.stats.freq-species.tsv
$ head -n 21 taxid.map.stats.freq-species.tsv | csvtk pretty -t
$ head -n 21 taxid.map.stats.freq-species.tsv \
| csvtk pretty -t
species frequency
-------------------------- ---------
Escherichia coli 26859
Expand Down Expand Up @@ -256,6 +258,56 @@ also shows the taxonomic information of current version (R207) and the taxon his
|R207 |d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli |
|R202 |d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia flexneri|

### Species of the genus Escherichia

# set the direcotory of taxdump file
export TAXONKIT_DB=gtdb-taxdump/R207

$ taxonkit list --ids 3334977531 -I "" \
| taxonkit filter -E species \
| taxonkit lineage -Lnr \
| tee Escherichia.tsv
205980079 Escherichia ruysiae species
266095079 Escherichia marmotae species
1474290498 Escherichia sp004211955 species
2519456452 Escherichia sp002965065 species
2673387089 Escherichia coli_E species
2878517233 Escherichia albertii species
3072179875 Escherichia fergusonii species
3820717297 Escherichia sp005843885 species
4093283224 Escherichia coli species
4221400829 Escherichia sp001660175 species

$ csvtk join -Ht Escherichia.tsv \
<(cut -f 1 Escherichia.tsv \
| rush 'echo -ne "{}\t$(taxonkit list --ids {} -I "" \
| taxonkit filter -L species | wc -l)\n"') \
| csvtk add-header -t -n "taxid,name,rank,#assembly" \
| csvtk sort -t -k "#assembly:nr" -k name \
| csvtk csv2md -t
|taxid |name |rank |#assembly|
|:---------|:----------------------|:------|:--------|
|4093283224|Escherichia coli |species|26859 |
|2878517233|Escherichia albertii |species|107 |
|266095079 |Escherichia marmotae |species|82 |
|3072179875|Escherichia fergusonii |species|77 |
|3820717297|Escherichia sp005843885|species|37 |
|205980079 |Escherichia ruysiae |species|36 |
|4221400829|Escherichia sp001660175|species|3 |
|1474290498|Escherichia sp004211955|species|2 |
|2673387089|Escherichia coli_E |species|1 |
|2519456452|Escherichia sp002965065|species|1 |

What's the *Escherichia coli_E*? There's only one genome: [GCF_011881725.1](https://gtdb.ecogenomic.org/genome?gid=GCF_011881725.1)

$ taxonkit list --ids 2673387089 -nr
2673387089 [species] Escherichia coli_E
1744010345 [no rank] 011881725

$ grep 1744010345 gtdb-taxdump/R207/taxid.map
GCF_011881725.1 1744010345

### Common manipulations

Except the four taxdump files, we provide a `taxid.map` file which maps genome accessions to TaxIds.
Expand Down
107 changes: 107 additions & 0 deletions gtdb-taxdump/R207/ranks.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@

# This file defines taxonomic rank order for taxdump/taxonkit.
#
# Here'are the rules:
# 1. Blank lines or lines starting with "#" are ignored.
# 2. Ranks are in decending order and case ignored.
# 3. Ranks with same order should be in one line separated with comma (",", no space).
# 4. Ranks without order should be assigned a prefix symbol "!" for each rank.
#
# Deault ranks reference from https://en.wikipedia.org/wiki/Taxonomic_rank ,
# and contains some ranks from NCIB Taxonomy database.
#

!no rank
!clade


life

domain,superkingdom,realm,empire

kingdom
subkingdom
infrakingdom
parvkingdom

superphylum,superdivision
phylum,division
subphylum,subdivision
infraphylum,infradivision
microphylum,microdivision

superclass
class
subclass
infraclass
parvclass

superlegion
legion
sublegion
infralegion

supercohort
cohort
subcohort
infracohort

gigaorder
magnorder,megaorder
grandorder,capaxorder
mirorder,hyperorder
superorder
# series
order
# parvorder
nanorder
hypoorder
minorder
suborder
infraorder
parvorder

# section
# subsection

gigafamily
megafamily
grandfamily
hyperfamily
superfamily
epifamily
# series
group
family
subfamily
infrafamily

supertribe
tribe
subtribe
infratribe

genus
subgenus
section
subsection
series
subseries


superspecies,species group
species subgroup
species

subspecies,forma specialis,pathovar

pathogroup,serogroup
biotype,serotype,genotype

variety,varietas,morph,aberration
subvariety,subvarietas,submorph,subaberration
form,forma
subform,subforma

strain
isolate

0 comments on commit fc6fa87

Please # to comment.