Add COG assignments to annotations #85

tseemann · 2015-03-15T02:20:17Z

Use the COG.hmm you have, and pre-assign COG to all sprot proteins

Is there a master file of GI => COG at NCBI CDD anywhere?

aleimba · 2015-03-16T17:23:02Z

I use NCBI's rpsblast+ to assign COGs via CDD's PSSMs, which has a good sensitivity (https://github.com/aleimba/bac-genomics-scripts/tree/master/cdd2cog).
I pretty much follow JGI's IMG/ER system (e.g.: http://standardsingenomics.org/index.php/sigen/article/download/sigs.632/22). Although when I tested it my script still resulted in some minor differences to IMG, no idea why. There's a file which correlates the PSSM-ID to the COG number (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/cddid.tbl.gz), but I don't know of a file in CDD which correlates GI with COG.

However, there's been a COG update recently (http://www.ncbi.nlm.nih.gov/pubmed/25428365). I have not adapted my script yet to the new data structure, but there're several new files on the FTP server which might be of interest. Of course there's no description associated with these: ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data/

Useful files are prot2003-2014.tab, which associates RefSeq protein accession numbers to GI numbers, and cog2003-2014.csv, which associates the GI number with the COG number. I'm guessing that's what you're looking for.

tseemann · 2015-03-17T04:25:12Z

Many years ago I built HMMs of COGs from their MSAs. It worked faster and was more sensitive than RPS-BLAST and PSSMs. They were actually used in the original BASH based Prokka system!

THANK YOU for the link to the new COG !!! I thought it was on its last legs, but it lives on!

aleimba · 2015-03-17T09:03:32Z

yep, thought so too, but COGs seem to be tough.
HMMs sound good, however you'll need to maintain them in case of future COG releases ;-).

VidJa · 2016-02-10T16:05:07Z

I did some tests with uproc for COG annotation (based on the december 2014 dataset). Its blazing fast (a few seconds @16 cores for an average bacterial genome) , but probably not as accurate as HMM based searches. I would like to compare though.

tseemann · 2016-02-11T04:01:32Z

@aleimba It seems the new COG doesn't actually have any new models - it has simply just assigned COGs to all the CDS in all the new genomes in Genbank since the original COG :-(

tseemann · 2016-02-11T04:21:53Z

@VidJa uproc sounds pretty good, but the database size is pretty big, i'll keep it in mind!

aleimba · 2016-02-11T08:46:34Z

@tseemann Hm, bummer. Thanks for looking into it! I saw a talk from Michael Galperin a while ago, I vaguely remember this being the case, but at least they did some corrections and additions to the existing COGs. This will hopefully improve assignment to "under-sequenced" species ...

@VidJa uproc is from my university, but never seen it. Thanks for posting, looks interesting.

VidJa · 2016-02-11T09:05:32Z

I'm also testing uproc with EggNOG (the bacterial section). The HMM models for EggNOG are available as well. At least it includes more species although not manually curated.

tseemann · 2016-02-13T09:13:22Z

@VidJa I also have the EggNOG models and was playing around with them. FigFAMs as well. I wish all these protein clustering people would get together and curate! It's hard to assess them. And it's often species dependent which is better.

tseemann · 2018-03-25T06:00:16Z

@aleimba I've finally started on COG support: https://github.com/tseemann/prokka/releases/tag/v1.13

aleimba · 2018-03-29T22:47:14Z

@tseemann awesome!

How did you assign COGs to your reference databases? Did you build new COG HMMs?

I have a long train ride tomorrow and out of coincidence was thinking about finally starting to update cdd2cog ...

tseemann · 2018-03-30T05:19:11Z

@aleimba No - i just took the COG annotation out of the swissprot .dat file.

I also disovered that swissprot has LOTS of near-duplicate proteins which don't have the COG. It's very inconsistent. That's why I call it "preliminary support" :)

I want to do some proper clustering and combining the new NCBI "prk" clusters and the COG annotations etc. it's a bug job.

vbonnici · 2018-03-30T13:06:12Z

Thank you for the COG support.
Is there any way of building a new database and having COG features?
I usually download Assemblies from NCBI (in gbff format) to build my species-specific database,
but I can't find a way for including COGs.

many thanks

apredeus · 2018-10-22T17:01:23Z

I guess I might as well ask it here.

What would you say would be the best way to do species-specific EC assignment? I'm currently annotating a very big number of Salmonella genomes, and I've compiled a nice collection of IPG reference proteins, and then overlapped it with pan-genome of key strains to get the names etc. But I can't think of a good way to get the enzymatic annotation for a big set of proteins.

Thanks in advance!

tseemann · 2018-10-23T05:12:56Z

The /EC_number ideally would be in the NCBI PGAP annotations the the reference strains in your collection? You could use roary to cluster homologs and copy the EC assignments? The only other way is to find orthologs in swissprot/trembl and copy the EC from those records. This is a common problem and hard to solve because EC, COG. etc are not uniformly assigned in the various databases. The KEGG protein DB would be good but it is no longer free.

apredeus · 2018-10-23T12:01:36Z

Thank you very much, this is sort of what I suspected - but it's good to know for sure.

tseemann · 2018-10-24T00:40:07Z

I would love to have a nice, clean, thorough database of bacterial proteins with EC, COG, GO etc! Given time/money I could probably achieve it, but those things are not forthcoming.

kustustrica · 2020-11-09T14:08:19Z

@tseemann Thank you so much for started on COG support!
I probably haven't found a way to do this, but is it possible to add arCOGs to genome annotation?

many thanks

tseemann added the enhancement label Mar 15, 2015

tseemann self-assigned this Mar 15, 2015

aleimba mentioned this issue Apr 7, 2016

Add functional data to existing CDS annotations? #63

Closed

aleimba mentioned this issue Nov 4, 2016

update compatibility to COG2014? aleimba/bac-genomics-scripts#2

Open

tseemann added this to the Prokka 1.13 milestone Mar 5, 2017

tseemann closed this as completed Mar 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add COG assignments to annotations #85

Add COG assignments to annotations #85

tseemann commented Mar 15, 2015

aleimba commented Mar 16, 2015

tseemann commented Mar 17, 2015

aleimba commented Mar 17, 2015

VidJa commented Feb 10, 2016

tseemann commented Feb 11, 2016

tseemann commented Feb 11, 2016

aleimba commented Feb 11, 2016

VidJa commented Feb 11, 2016

tseemann commented Feb 13, 2016

tseemann commented Mar 25, 2018

aleimba commented Mar 29, 2018

tseemann commented Mar 30, 2018

vbonnici commented Mar 30, 2018

apredeus commented Oct 22, 2018

tseemann commented Oct 23, 2018

apredeus commented Oct 23, 2018 •

edited

Loading

tseemann commented Oct 24, 2018

kustustrica commented Nov 9, 2020 •

edited

Loading

Add COG assignments to annotations #85

Add COG assignments to annotations #85

Comments

tseemann commented Mar 15, 2015

aleimba commented Mar 16, 2015

tseemann commented Mar 17, 2015

aleimba commented Mar 17, 2015

VidJa commented Feb 10, 2016

tseemann commented Feb 11, 2016

tseemann commented Feb 11, 2016

aleimba commented Feb 11, 2016

VidJa commented Feb 11, 2016

tseemann commented Feb 13, 2016

tseemann commented Mar 25, 2018

aleimba commented Mar 29, 2018

tseemann commented Mar 30, 2018

vbonnici commented Mar 30, 2018

apredeus commented Oct 22, 2018

tseemann commented Oct 23, 2018

apredeus commented Oct 23, 2018 • edited Loading

tseemann commented Oct 24, 2018

kustustrica commented Nov 9, 2020 • edited Loading

apredeus commented Oct 23, 2018 •

edited

Loading

kustustrica commented Nov 9, 2020 •

edited

Loading