Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

Open
VincentGardeux opened this issue Mar 21, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@VincentGardeux
Copy link

VincentGardeux commented Mar 21, 2024

Hi,

I'm setting up pySCENIC for generating regulons from two Fly brain scRNAseq datasets we generated. The goal is to reproduce to some extent what was done in the aging Drosophila brain atlas (Davie et al, 2018). I especially want to reproduce the separation in Figure 4 based on (dati, pros, Imp, scro) regulons.

I'm running pySCENIC v0.12.1 in JupyterHub through Docker, with the latest motif and feather files (v10): mc_v10_clust
Everything seems to run ok using this (slightly tuned) tutorial.

I'm writing this issue because I think that there is an unwanted behavior in the prune2df function. And also as a guide for future users, since I struggled quite a lot to make this work.

First, I would encourage people to use the pySCENIC version (instead of the R, or the nextflow pipeline, whose maintenance is discontinued or doesn't work with the latest databases).

When running the tutorial, I got many warnings with the prune2df function. But the run still finished, and I ended up with 46 regulons... which did not seem like a lot. Especially because in the (Davie et al, 2018) paper the authors appear to have 163 regulons (Table S4). The main issue is that I didn't get the (dati, pros, scro) important regulons that I was looking for.

  • Problem: When running the cisTarget part (specifically, the prune2df function), I quickly get tons of Warnings: Less than 80% of the genes in ... could be mapped to ... which was already reported (but never really solved) in other issues: pyscenic warning: less than 80% of the genes in xxx could be mapped to ... #466 [BUG] pyscenic.transform warning/error #515 [BUG]pyscenic ctx --no_pruning #506 Empty file output from pyscenic ctx using original/modified databases #325 Empty regulons after cisTarget pruning step[results] #177 and pull requests: Changeable cutoff for number of genes found in DB for module2df #387

  • Problem: Kinda related to problem 1, but the number of regulons I get is very small as compared to what I would expect

  • Solution 1: I found out, that using the rho_mask_dropouts=True option in the step before (modules_from_adjacencies), it increases the number of regulons to 81, but still without my looked-for regulons (and still with all the warnings).

  • Solution 2: So I've set up to remove the warnings in the prune2df function. For this, I thought that the problem was coming from the *feather files, where genes were not overlapping with the genes I have in my expression matrix. So, I've extracted the genes used in the feather file, transformed them to Flybase/Ensembl IDs (btw using gene symbols instead of IDs is not a good idea I think), and updated the gene names in my expression matrix. So now, almost all of the feather genes are in my expression matrix. Then I've run again the pipeline, but was very surprised that it did not solve the warning issues... which did not make sense to me? It still worked to some extent, since got 99 regulons, but the increase was very marginal. And I did not understand why it was not solving the warning issues.

  • Solution 3: [Bug?] So I've thought that maybe the issue was coming from the fact that my expression matrix was containing extra genes that were NOT in the .feather files. So I've added this piece of code at the beginning of my script, to restrict my expression matrix to ONLY genes present in the .feather file:

import pandas as pd
# ex_matrix is my expression matrix
ranking_feather = pd.read_feather("dm6_v10_clust.genes_vs_motifs.rankings.feather")
overlap_values = ex_matrix.index[pd.Series(ex_matrix.index).isin(ranking_feather.columns)].unique()
ex_matrix = ex_matrix.loc[overlap_values, :]

This solves completely the warning issues in the pruning step, and I end up with 108 regulons. But this is not an expected behavior of the method, is it? I mean I understand the issues if some genes of the feather database are missing from the expression matrix, but it looks ok to have extra genes in your expression matrix? So why would this cause a warning? And limits the number of regulons generated?

  • Solution 4: Just to complete my post. Even with all these updates, I could not get the regulons I was looking for. I was finally able to get the regulons I wanted (total of 156) by setting the auc_threshold=0.01 parameter in the prune2df function (instead of 0.05 by default). I'm not sure though what is the real impact of this, as I could not find clear explanation of what this parameter is doing. Another way is to completely deactivate the pruning/filtering by using the filter_for_annotation=False argument in the prune2df function.

Hope this can help other users. And to the authors/maintainers please give me your opinion on point 3, as I really thought this behavior was weird/unexpected?

@VincentGardeux VincentGardeux added the bug Something isn't working label Mar 21, 2024
@caochch
Copy link

caochch commented Apr 8, 2024

Any progresses? What's the expected number of modules or regulons in the ctx step (the row of output files)?

@Flu09
Copy link

Flu09 commented Jun 24, 2024

any news?

@rrydbirk
Copy link

Can confirm that following solution 2 and 3 improves no. regulons significantly! Also, completely omitted warnings for me.

@colin893
Copy link

colin893 commented Jul 3, 2024

Hi. I have an issue while running pyscenic on Zebrafish dataset. I downloaded the Stanaka motifs database and processed my pyscenic analysis. Just to add, I successfully ran pyscenic on both mouse and human datasets.

First I got an error due to the wrong version format of the file that I converted into v2 without problem. Then I generated the adj.csv file that threw no error and is filled. The problem arises with the command pyscenic ctx - I got the usual warnings "Less than 80% of the genes in Regulon for XXXX could be mapped to v2_zf1.genes_vs_motifs.rankings. Skipping this module." but at the end, my reg.csv file is empty.

So I imported the genes from the feather file of Stanaka, and intersected these with the genes from my expression matrix, just to ensure that there is a significant overlap between gene names. For respectively 23000 and 20000 genes in both lists, I got +- 15000 genes in common. Given this, in my understanding, I would expect some regulons to survive the pruning step?
So I am wondering if it's not because of something else?

Thank you in advance for your time,

@VincentGardeux
Copy link
Author

Hi @colin893,

I'm still following the updates on this issue, but so far did not see any answer from the devs/authors. :(

For your issue, did you try all the solutions I suggested in my post? In my case they substantially increased the number of regulons.

If you still loose all your regulons at the pruning stage, you can probably remove the pruning completely using filter_for_annotation=False
I have to admit that I tend to do that now, because my "preferred regulons" are always filtered out for reasons I don't understand.

Cheers

@colin893
Copy link

colin893 commented Jul 8, 2024

Hi @VincentGardeux ,

When I posted I had only verified that there exist overlap between gene names in my expression matrix and the gene names in the databases.

Thanks for your different hints, I just launched the ctx command with --no_pruning and I got 599 regulons. I guess there might be a way to allow the pruning but, since I successfully analyzed human/mouse data, what comes to my mind is that the default values used for these are too stringent for zebrafish analyses? I'll have to do tests and modify some threshold values to ensure that hypothesis.

I'll update my response depending on what I get, thanks for your help!

@ramadatta
Copy link

@VincentGardeux I have been struggling to find a solution for this from past 3 days. Thank you for posting this issue. My preferred regulons does not show up in the output.

While --no_pruning option has increased my regulon number from ~220 to ~1260, still I miss some of my regulons.
I am directly running pyscenic ctx and do not see the option filter_for_annotation.

Am I missing something here?

@VincentGardeux
Copy link
Author

Hi @ramadatta,

I guess you are running the tool in command line and not within Python. So the equivalent method for filter_for_annotation seems to be the --no_pruning option that you already use.
When you do that, you end up with the full list of TFs that you've provided at step 1 in the corresponding file. So in your case you probably have a list of ~1260 TFs, without the ones you are looking for.

However, something else that I realized since my initial post-at least for the Drosophila pipeline that I'm using-is that the TF list that you need to provide at Step 1 (pyscenic grn, Building the GRN), and that is provided by SCENIC, is partially wrong.
For example, in my case I was using the allTFs_dmel.txt file that they provide. But after a careful examination of the TFs by collaborators, we found that this list contains lots of genes that are NOT TFs... Similarly, some of the TFs that we were looking for were not in the list either (maybe this is your issue here). So we ended up providing another list, computed from FlyBase (the main resource for Drosophila). And it provided a completely different result !

Hope this helps

NB: To updated the other users, I still don't have any news from the maintainers of the tool :(

@ramadatta
Copy link

ramadatta commented Oct 28, 2024

@VincentGardeux Thank you for the detailed answer. This could be a possible explanation of why some of the TFs are not coming up in the final analyis. It seems to be better to overlap the TFs with other studies as well to generate a curated list of markers.

On other note, I have 1881 TFs found in my object out of 1892 TFs provided from as an input with no_pruning option, I ended up with ~1260 TFs. Probably, the cutoffs as you mentioned auc_threshold, rho_mask_dropouts may also be filtering out some of the regulons.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants