Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bug fix: Disease-gene associations #158

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from
Open

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Oct 25, 2024

resolves #156

Changes

  • Bug fix: Code for adding Disease2Gene relationships. These should be added if and only if there is one gene associated with the disease, unless it is a digenic disease/phenotype. However before, we were erroneously checking the number of diseases associated with a given gene.
  • Delete: Code for adding Gene2Disease relationships
  • Add; SPARQL query and goal to list relationships
  • Added and updated comments related to gene disease associations

Resources

@joeflack4 joeflack4 force-pushed the gene-disease-comments branch 2 times, most recently from 4a33ef1 to 53a25d2 Compare October 25, 2024 20:03
@joeflack4 joeflack4 self-assigned this Oct 25, 2024
@joeflack4 joeflack4 added the documentation Improvements or additions to documentation label Oct 25, 2024
@joeflack4 joeflack4 marked this pull request as draft October 25, 2024 20:04
Base automatically changed from owl-restriction-refactor to main October 25, 2024 20:32
@joeflack4 joeflack4 changed the title Gene disease comments Bug fix: Gene-disease associations Oct 27, 2024
- Added and updated comments related to gene disease associations
- Fixing a bug where causaility and other relationships are not being defined correctly.
- Update: In this commit, undoing temporary print statements.
- Add: comments
- Add; SPARQL query and goal to list relationships
- Delete: Code for adding Gene2Disease relationships
- Bug fix: Code for adding Disease2Gene relationships. These should be added if and only if there is one gene associated with the disease, unless it is a digenic disease/phenotype. However before, we were erroneously checking the number of diseases associated with a given gene.
@joeflack4 joeflack4 marked this pull request as ready for review October 27, 2024 06:08
omim2obo/main.py Outdated
add_subclassof_restriction(graph, RO['0002525'], CHR[chr_id], OMIM[gene_mim])

# TODO: @Trish: I've removed addition of gene->disease relations, leaving only disease->gene. Is that what we want?
# Removed because extra complexity. Want to get down disease->gene code first. And Mondo doesn't need them.
Copy link
Contributor

@twhetzel twhetzel Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed calculating the number of associations differently since it did not appear that if a gene was unique in the morbidmap file that it was sufficient to tag that gene as causal for a disease based on the check here, and instead that the total number of genes per phenotype mim number (ie disease) should be used to determine if a gene is causal (ie only 1 gene per disease).

  1. What's the reason/error that you saw to remove gene->disease?
  2. Is Mondo the only place where these omim release files are used?

Copy link
Contributor Author

@joeflack4 joeflack4 Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Gene removal
    There are a few reasons. It's been a year since this was written, so I don't remember everything, but I keep being reminded about the disease-centric nature of things, and I wonder whether we meant to add them in the first place. And moreso, I want this code review to be easier at first; for us to get the disease-gene stuff handled first. If you look at the code, I had to add on a bit to what you wrote to complete it.
    Thinking more about it today though, I did leave part of the gene restrictions; the "cyto location" bit. And for adding the other ones back, I think the best way to handle it is to add some extra lines to this "cyto location" block. It'll actually look very similar to what was there before; I just need to drop the "n associations" conditional for the genes as that only matters for the diseases.
    TLDR: Let's make sure the Disease2Gene stuff is correct first and I can add back the Gene2Disease shortly.
  1. Yes, AFAIK only mondo-ingest consumes these outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the gene disease relationships back.

I had to put some thought into it to make sure I felt it was correct, and to make the code flow as cleanly and understandably as I could make it.

Feel free to review here or during meeting.

omim2obo/main.py Outdated Show resolved Hide resolved
@joeflack4 joeflack4 changed the title Bug fix: Gene-disease associations Bug fix: Disease-gene associations Oct 31, 2024
- Delete: Logic for digenic genes.
- Add: Back logic for restricting to mapping key 3.
- Add: Back gene->disease rels
- Add: Filtering of non-definitive relationships (presence of [, non-diseases, {, susceptibility, and ?, provisional).
- Bug fix: Missing space in warning message
omim2obo/main.py Outdated
# Only add if (1) only 1 association (2)‚mapping key = 3, (3) 'definitive'
phenotype_genes2: Dict[str, Dict[str, str]] = {k: v[0] for k, v in phenotype_genes.items() if len(v) == 1} # 1assoc
phenotype_genes2 = {k: v for k, v in phenotype_genes2.items() if p2g_is_definitive(v['phenotype_label'])}
for p_mim, assoc in phenotype_genes2.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will still include the handful of diseases where p_label contains the word 'digenic' since there are cases where these exist with 1 gene association and a mapping key of 3 and are not prefixed with an omim special character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. I was going to follow up with you about that.

Basically my rationale for not filtering these out is because of the OMIM team's response. It seems to me that they haven't really insured the data quality of digenic diseases. So I suppose that could mean one of the following or both:

  1. Non digenic, but being labeled as such
  2. Digenic but they only labeled one of the associations

I had the first case at the forefront of my mind. But now that I think about it, for these cases where it is labeled diegenic but it only has one association, we have no idea which of these two cases it is.

So therefore, I do agree it is a good idea to filter them out now.

However I wonder if we want to consider outputting them into a special table for curator review. Perhaps some of them are not actually digenic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think outputting them to a file for curator review is a good idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
I added review.tsv to the release files and documentation for it, including the types of cases found inside of it, in the README.md.

@twhetzel twhetzel mentioned this pull request Nov 3, 2024
- Add: Finished adding gene->disease rels back
- Add: Additional filter for diseases marked digenic even though they only have 1 association. But will output these special cases to a file.
- Add: Additional comments
- Add: New output / release file 'review.tsv', with typed dictionary class and documentation in README.md: First and currently only case we are adding: Situations where found a phenotype where we would have added a causal subclass restriction were it not for 'digenic' being in the label (possibly erroneously).
- Bug fix / updates: SPARQL query: Renamed some columns, reordered columns, updated sorting. Fixed "Direction" and renamed to "PredDirection". Added SubBiolink and ObjBiolink types columns.
- Update: Comments regrading biolink:category, etc, derived from MIM types
- Bug fix: Removed accidental "causal" G2D restrictions being added when the corresponding D2G was not found to be definitively causal.
- Update: Docs: Note about the inputs of the repo and how they don't always match omim.org/entry pages
- Fix: Docs: grammatical error
- Fix: A misstatment just added to the docs.
omim2obo/main.py Outdated Show resolved Hide resolved
@joeflack4 joeflack4 changed the base branch from main to develop November 6, 2024 03:18
- Fix: Docs
  - Fix: Grammar: erroneous line break
  - Fix: Typo
- Update: Remove filtering out of single-association cases where phenotype marked 'digenic'.
- Add: Related comment: 9606: NCBI Taxonomy ID for Homo Sapiens
- Add: Warning if MIM type of Phenotype-Gene assoc found to not be that of "Phenotype".
- Bug fix: Some Phenotype-Gene associations were not identified due to a REGEX parsing error requiring an unnecessary (but 99% of the time present, as apparently being part of the OMIM naming spec for this field) comma.
@twhetzel twhetzel self-requested a review November 17, 2024 08:34
Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said on Slack, this should be merged into develop and of course then into main and a new release should be generated so the OMIM gene pipeline can be tested with these updates. This is needed ASAP.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gene association representation in omim.ttl
2 participants