-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add COG assignments to annotations #85
Comments
Hi @tseemann, I use NCBI's However, there's been a COG update recently (http://www.ncbi.nlm.nih.gov/pubmed/25428365). I have not adapted my script yet to the new data structure, but there're several new files on the FTP server which might be of interest. Of course there's no description associated with these: ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data/ Useful files are prot2003-2014.tab, which associates RefSeq protein accession numbers to GI numbers, and cog2003-2014.csv, which associates the GI number with the COG number. I'm guessing that's what you're looking for. |
Many years ago I built HMMs of COGs from their MSAs. It worked faster and was more sensitive than RPS-BLAST and PSSMs. They were actually used in the original BASH based Prokka system! THANK YOU for the link to the new COG !!! I thought it was on its last legs, but it lives on! |
yep, thought so too, but COGs seem to be tough. |
@aleimba It seems the new COG doesn't actually have any new models - it has simply just assigned COGs to all the CDS in all the new genomes in Genbank since the original COG :-( |
@VidJa uproc sounds pretty good, but the database size is pretty big, i'll keep it in mind! |
@tseemann Hm, bummer. Thanks for looking into it! I saw a talk from Michael Galperin a while ago, I vaguely remember this being the case, but at least they did some corrections and additions to the existing COGs. This will hopefully improve assignment to "under-sequenced" species ... @VidJa uproc is from my university, but never seen it. Thanks for posting, looks interesting. |
I'm also testing uproc with EggNOG (the bacterial section). The HMM models for EggNOG are available as well. At least it includes more species although not manually curated. |
@VidJa I also have the EggNOG models and was playing around with them. FigFAMs as well. I wish all these protein clustering people would get together and curate! It's hard to assess them. And it's often species dependent which is better. |
@aleimba I've finally started on COG support: https://github.com/tseemann/prokka/releases/tag/v1.13 |
@tseemann awesome! How did you assign COGs to your reference databases? Did you build new COG HMMs? I have a long train ride tomorrow and out of coincidence was thinking about finally starting to update |
@aleimba No - i just took the COG annotation out of the swissprot .dat file. I also disovered that swissprot has LOTS of near-duplicate proteins which don't have the COG. It's very inconsistent. That's why I call it "preliminary support" :) I want to do some proper clustering and combining the new NCBI "prk" clusters and the COG annotations etc. it's a bug job. |
Thank you for the COG support. many thanks |
I guess I might as well ask it here. What would you say would be the best way to do species-specific EC assignment? I'm currently annotating a very big number of Salmonella genomes, and I've compiled a nice collection of IPG reference proteins, and then overlapped it with pan-genome of key strains to get the names etc. But I can't think of a good way to get the enzymatic annotation for a big set of proteins. Thanks in advance! |
The |
Thank you very much, this is sort of what I suspected - but it's good to know for sure. |
I would love to have a nice, clean, thorough database of bacterial proteins with EC, COG, GO etc! Given time/money I could probably achieve it, but those things are not forthcoming. |
@tseemann Thank you so much for started on COG support! many thanks |
Use the COG.hmm you have, and pre-assign COG to all sprot proteins
Is there a master file of GI => COG at NCBI CDD anywhere?
The text was updated successfully, but these errors were encountered: