Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add COG assignments to annotations #85

Closed
tseemann opened this issue Mar 15, 2015 · 18 comments
Closed

Add COG assignments to annotations #85

tseemann opened this issue Mar 15, 2015 · 18 comments
Assignees
Milestone

Comments

@tseemann
Copy link
Owner

Use the COG.hmm you have, and pre-assign COG to all sprot proteins

Is there a master file of GI => COG at NCBI CDD anywhere?

@tseemann tseemann self-assigned this Mar 15, 2015
@aleimba
Copy link

aleimba commented Mar 16, 2015

Hi @tseemann,

I use NCBI's rpsblast+ to assign COGs via CDD's PSSMs, which has a good sensitivity (https://github.com/aleimba/bac-genomics-scripts/tree/master/cdd2cog).
I pretty much follow JGI's IMG/ER system (e.g.: http://standardsingenomics.org/index.php/sigen/article/download/sigs.632/22). Although when I tested it my script still resulted in some minor differences to IMG, no idea why. There's a file which correlates the PSSM-ID to the COG number (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/cddid.tbl.gz), but I don't know of a file in CDD which correlates GI with COG.

However, there's been a COG update recently (http://www.ncbi.nlm.nih.gov/pubmed/25428365). I have not adapted my script yet to the new data structure, but there're several new files on the FTP server which might be of interest. Of course there's no description associated with these: ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data/

Useful files are prot2003-2014.tab, which associates RefSeq protein accession numbers to GI numbers, and cog2003-2014.csv, which associates the GI number with the COG number. I'm guessing that's what you're looking for.

@tseemann
Copy link
Owner Author

Many years ago I built HMMs of COGs from their MSAs. It worked faster and was more sensitive than RPS-BLAST and PSSMs. They were actually used in the original BASH based Prokka system!

THANK YOU for the link to the new COG !!! I thought it was on its last legs, but it lives on!

@aleimba
Copy link

aleimba commented Mar 17, 2015

yep, thought so too, but COGs seem to be tough.
HMMs sound good, however you'll need to maintain them in case of future COG releases ;-).

@VidJa
Copy link

VidJa commented Feb 10, 2016

I did some tests with uproc for COG annotation (based on the december 2014 dataset). Its blazing fast (a few seconds @16 cores for an average bacterial genome) , but probably not as accurate as HMM based searches. I would like to compare though.

@tseemann
Copy link
Owner Author

@aleimba It seems the new COG doesn't actually have any new models - it has simply just assigned COGs to all the CDS in all the new genomes in Genbank since the original COG :-(

@tseemann
Copy link
Owner Author

@VidJa uproc sounds pretty good, but the database size is pretty big, i'll keep it in mind!

@aleimba
Copy link

aleimba commented Feb 11, 2016

@tseemann Hm, bummer. Thanks for looking into it! I saw a talk from Michael Galperin a while ago, I vaguely remember this being the case, but at least they did some corrections and additions to the existing COGs. This will hopefully improve assignment to "under-sequenced" species ...

@VidJa uproc is from my university, but never seen it. Thanks for posting, looks interesting.

@VidJa
Copy link

VidJa commented Feb 11, 2016

I'm also testing uproc with EggNOG (the bacterial section). The HMM models for EggNOG are available as well. At least it includes more species although not manually curated.

@tseemann
Copy link
Owner Author

@VidJa I also have the EggNOG models and was playing around with them. FigFAMs as well. I wish all these protein clustering people would get together and curate! It's hard to assess them. And it's often species dependent which is better.

@tseemann
Copy link
Owner Author

@aleimba I've finally started on COG support: https://github.com/tseemann/prokka/releases/tag/v1.13

@aleimba
Copy link

aleimba commented Mar 29, 2018

@tseemann awesome!

How did you assign COGs to your reference databases? Did you build new COG HMMs?

I have a long train ride tomorrow and out of coincidence was thinking about finally starting to update cdd2cog ...

@tseemann
Copy link
Owner Author

@aleimba No - i just took the COG annotation out of the swissprot .dat file.

I also disovered that swissprot has LOTS of near-duplicate proteins which don't have the COG. It's very inconsistent. That's why I call it "preliminary support" :)

I want to do some proper clustering and combining the new NCBI "prk" clusters and the COG annotations etc. it's a bug job.

@vbonnici
Copy link

Thank you for the COG support.
Is there any way of building a new database and having COG features?
I usually download Assemblies from NCBI (in gbff format) to build my species-specific database,
but I can't find a way for including COGs.

many thanks

@apredeus
Copy link

I guess I might as well ask it here.

What would you say would be the best way to do species-specific EC assignment? I'm currently annotating a very big number of Salmonella genomes, and I've compiled a nice collection of IPG reference proteins, and then overlapped it with pan-genome of key strains to get the names etc. But I can't think of a good way to get the enzymatic annotation for a big set of proteins.

Thanks in advance!

@tseemann
Copy link
Owner Author

The /EC_number ideally would be in the NCBI PGAP annotations the the reference strains in your collection? You could use roary to cluster homologs and copy the EC assignments? The only other way is to find orthologs in swissprot/trembl and copy the EC from those records. This is a common problem and hard to solve because EC, COG. etc are not uniformly assigned in the various databases. The KEGG protein DB would be good but it is no longer free.

@apredeus
Copy link

apredeus commented Oct 23, 2018

Thank you very much, this is sort of what I suspected - but it's good to know for sure.

@tseemann
Copy link
Owner Author

I would love to have a nice, clean, thorough database of bacterial proteins with EC, COG, GO etc! Given time/money I could probably achieve it, but those things are not forthcoming.

@kustustrica
Copy link

kustustrica commented Nov 9, 2020

@tseemann Thank you so much for started on COG support!
I probably haven't found a way to do this, but is it possible to add arCOGs to genome annotation?

many thanks

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

6 participants