Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

joined gene names, a possible pitfall to cause incorrect result? #178

Open
biocyberman opened this issue Mar 13, 2016 · 2 comments
Open

Comments

@biocyberman
Copy link

Is chanjo aware of this problematic gene names, which may causes various problems for queries that base on gene names?

➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|head                                                                                                                                                                                 
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|wc -l                                                                                                                                                                                
66188

➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|sort|uniq|wc -l                                                                                                                                                                      
9290
➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|sort|uniq >problematic.gene.names.txt 
@biocyberman
Copy link
Author

An test query on NOX1 returned a result. So I guess chanjo does indeed take care of the problem. Could you @robinandeer explain how it does that? Maybe point me to the relevant code section is enough.

@robinandeer
Copy link
Contributor

I'm not quiet sure what you mean :/

The only problematic gene names I know on are the ones that exist on both the X and Y chromosomes and have to be given prefixes.

It looks like you are picking out exons that belong to multiple transcripts which all map to the same gene but the input looks correct :)

Remember that it's only in the loading step these colums matter - for annotations, only the chrom, start, end columns matter

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants