Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

When converting 'taxid' into full taxonomy from prot.accession2taxid , the program terminated after an error is reported #55

Closed
Neal050617 opened this issue Feb 15, 2022 · 10 comments
Labels

Comments

@Neal050617
Copy link

Prerequisites

taxonkit v0.9.0
go version go1.17.7 linux/amd64

Describe your issue

wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
sed '1d' prot.accession2taxid | csvtk cut -t -f 2,3 | taxonkit lineage -i 3
| taxonkit reformat -i 3 -f "{k};{p};{c};{o};{f};{g};{s};{t}" -F -P -S -j 24
| csvtk cut -t -f 1,2,4
| csvtk add-header -t -n accession,taxid,taxonomy > nr.tax

[ERRO] parse error on line 37402419, column 96: bare " in non-quoted-field

while there is no quotation marks detected.

@shenwei356
Copy link
Owner

Add the flag -l to csvtk.

Besides, taxonkit reformat accept TaxIds as input, so there's no need to run taxonkit lineage before it.

  -I, --taxid-field int                field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field

Also see the example 1 at usage page.

sed 1d prot.accession2taxid \
    | csvtk cut -l -t -f 2,3 \
    | taxonkit reformat -I 2 -f "{k};{p};{c};{o};{f};{g};{s};{t}" -F -P -S \
    | csvtk cut -l -t -f 1,2,3 \
    | csvtk add-header -l -t -n accession,taxid,taxonomy \
    > nr.tax

while there is no quotation marks detected.

Maybe in some taxonomic names. Here they are:

$ taxonkit list --ids 1 -I "" | taxonkit lineage -L -n -r  | grep '"' | more
1906029 Nostoc sp. 'Peltigera sp. "hawaiensis" P1236 cyanobiont'        species
2727889 Pleurocapsales cyanobacterium 'Beach rock 4+5"' species
1920041 Expression vector "pure" split-T7P564   species

@Neal050617
Copy link
Author

Thanks, Dr. Shen, you wrote such good software, and sorry I didn't follow your tutorial carefully.
After using the revised script, a new error message occurred.
Thank you for taking the time.

Here is the new error message:
#########################################
panic: runtime error: index out of range [0] with length 0

goroutine 124244 [running]:
github.com/shenwei356/taxonkit/taxonkit/cmd.glob..func8.1({0xc03b35d398, 0xc03ada2a00})
/home/shenwei/shenwei/scripts/go/src/github.com/shenwei356/taxonkit/taxonkit/cmd/reformat.go:434 +0x1dba
github.com/shenwei356/breader.(*BufferedReader).run.func2.1({0x7265746361626f72, {0xc03b5a6800, 0x7265686373455f5f, 0x5f733b6169686369}})
/home/shenwei/shenwei/scripts/go/pkg/mod/github.com/shenwei356/breader@v0.3.1/BufferedReader.go:177 +0x1bd
created by github.com/shenwei356/breader.(*BufferedReader).run.func2
/home/shenwei/shenwei/scripts/go/pkg/mod/github.com/shenwei356/breader@v0.3.1/BufferedReader.go:169 +0xee

@shenwei356
Copy link
Owner

Yes, it's a bug, but only occurred for input of deleted taxids with the flag -F/--fill-miss-rank .

You may have used accession2taxid and taxonomy taxdump files that do not match (of different versions), with some taxids in the accession2taxid file been deleted in the taxdump files.

I've fixed it, please use the binaries below.

@Neal050617
Copy link
Author

It worked! Thanks a lot.

@SergeyBaikal
Copy link

SergeyBaikal commented Nov 7, 2022

Dear developers! Could you clarify please is correct? I also had an error, but after adding -l it disappeared. My goal is to count the unique ranks. In the input file, I just have a taxon column.

taxonkit lineage taxid.txt | awk '$2!=""' > lineage.txt
taxonkit reformat lineage.txt | tee lineage.txt.reformat
cut -f 1,3 lineage.txt.reformat

cat lineage.txt \
    | taxonkit reformat  -I 1 -F -f "{f}"\
    | csvtk -l -H -t cut -f 1,3 \
    | csvtk -H -t sep -f 2 -s ';' -R \
    | csvtk add-header -t -n taxid,family\
    | csvtk -t csv2tab  > Family.txt
	
awk '{$1=""}1' Family.txt | awk '{$1=$1}1' > Family_1col.txt           
	
cat Family_1col.txt | sort | uniq -c | sort -rn > unic_Family_all.txt

@shenwei356
Copy link
Owner

Hi, I'd recommend using commands below:

$ cat taxid.txt \
    | taxonkit reformat -I 1 -f '{f}' \
    | awk '$2!=""' \
    | csvtk freq -Ht -f 2 -nr

22:24:14.716 [WARN] taxid 123124124 not found
22:24:14.716 [WARN] taxid 3 was deleted
22:24:14.716 [WARN] taxid 92489 was merged into 796334
Akkermansiaceae 2
Bovidae 1
Comamonadaceae  1
Erwiniaceae     1
Francisellaceae 1
Hominidae       1
Retroviridae    1
Siphoviridae    1

@SergeyBaikal
Copy link

Thank you! Well done. Now it is much better that it was before!

@SergeyBaikal
Copy link

SergeyBaikal commented Nov 8, 2022

Why does the program find only 12 taxa out of 14? What needs to be updated?

137758
137758
64279
64279
137758
1955153
2584979
1673646
103782
137758
2093224
1408133
291286
2786748

Potyviridae 5
Dicistroviridae 2
Closteroviridae 1
Cystoviridae 1
Endornaviridae 1
Nodaviridae 1
Picobirnaviridae 1

@shenwei356
Copy link
Owner

It's easy to explain: the lineages of some taxid changed. See https://github.com/shenwei356/taxid-changelog/ . You can check the changes of the TaxIds above. taxid.log.tsv.gz

csvtk grep -f taxid -P taxid.txt taxid-changelog.csv.gz > taxid.log.tsv

So the result could change when using a different version of the NCBI taxdump file.

@SergeyBaikal
Copy link

shenwei356 Thanks a lot!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants