[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

VincentGardeux · 2024-03-21T13:36:54Z

Hi,

I'm setting up pySCENIC for generating regulons from two Fly brain scRNAseq datasets we generated. The goal is to reproduce to some extent what was done in the aging Drosophila brain atlas (Davie et al, 2018). I especially want to reproduce the separation in Figure 4 based on (dati, pros, Imp, scro) regulons.

I'm running pySCENIC v0.12.1 in JupyterHub through Docker, with the latest motif and feather files (v10): mc_v10_clust
Everything seems to run ok using this (slightly tuned) tutorial.

I'm writing this issue because I think that there is an unwanted behavior in the prune2df function. And also as a guide for future users, since I struggled quite a lot to make this work.

First, I would encourage people to use the pySCENIC version (instead of the R, or the nextflow pipeline, whose maintenance is discontinued or doesn't work with the latest databases).

When running the tutorial, I got many warnings with the prune2df function. But the run still finished, and I ended up with 46 regulons... which did not seem like a lot. Especially because in the (Davie et al, 2018) paper the authors appear to have 163 regulons (Table S4). The main issue is that I didn't get the (dati, pros, scro) important regulons that I was looking for.

Problem: When running the cisTarget part (specifically, the prune2df function), I quickly get tons of Warnings: Less than 80% of the genes in ... could be mapped to ... which was already reported (but never really solved) in other issues: pyscenic warning: less than 80% of the genes in xxx could be mapped to ... #466 [BUG] pyscenic.transform warning/error #515 [BUG]pyscenic ctx --no_pruning #506 Empty file output from pyscenic ctx using original/modified databases #325 Empty regulons after cisTarget pruning step[results] #177 and pull requests: Changeable cutoff for number of genes found in DB for module2df #387
Problem: Kinda related to problem 1, but the number of regulons I get is very small as compared to what I would expect
Solution 1: I found out, that using the rho_mask_dropouts=True option in the step before (modules_from_adjacencies), it increases the number of regulons to 81, but still without my looked-for regulons (and still with all the warnings).
Solution 2: So I've set up to remove the warnings in the prune2df function. For this, I thought that the problem was coming from the *feather files, where genes were not overlapping with the genes I have in my expression matrix. So, I've extracted the genes used in the feather file, transformed them to Flybase/Ensembl IDs (btw using gene symbols instead of IDs is not a good idea I think), and updated the gene names in my expression matrix. So now, almost all of the feather genes are in my expression matrix. Then I've run again the pipeline, but was very surprised that it did not solve the warning issues... which did not make sense to me? It still worked to some extent, since got 99 regulons, but the increase was very marginal. And I did not understand why it was not solving the warning issues.
Solution 3: [Bug?] So I've thought that maybe the issue was coming from the fact that my expression matrix was containing extra genes that were NOT in the .feather files. So I've added this piece of code at the beginning of my script, to restrict my expression matrix to ONLY genes present in the .feather file:

import pandas as pd
# ex_matrix is my expression matrix
ranking_feather = pd.read_feather("dm6_v10_clust.genes_vs_motifs.rankings.feather")
overlap_values = ex_matrix.index[pd.Series(ex_matrix.index).isin(ranking_feather.columns)].unique()
ex_matrix = ex_matrix.loc[overlap_values, :]

This solves completely the warning issues in the pruning step, and I end up with 108 regulons. But this is not an expected behavior of the method, is it? I mean I understand the issues if some genes of the feather database are missing from the expression matrix, but it looks ok to have extra genes in your expression matrix? So why would this cause a warning? And limits the number of regulons generated?

Solution 4: Just to complete my post. Even with all these updates, I could not get the regulons I was looking for. I was finally able to get the regulons I wanted (total of 156) by setting the auc_threshold=0.01 parameter in the prune2df function (instead of 0.05 by default). I'm not sure though what is the real impact of this, as I could not find clear explanation of what this parameter is doing. Another way is to completely deactivate the pruning/filtering by using the filter_for_annotation=False argument in the prune2df function.

Hope this can help other users. And to the authors/maintainers please give me your opinion on point 3, as I really thought this behavior was weird/unexpected?

The text was updated successfully, but these errors were encountered:

caochch · 2024-04-08T05:46:05Z

Any progresses? What's the expected number of modules or regulons in the ctx step (the row of output files)?

Flu09 · 2024-06-24T13:40:32Z

any news?

rrydbirk · 2024-06-25T12:12:19Z

Can confirm that following solution 2 and 3 improves no. regulons significantly! Also, completely omitted warnings for me.

colin893 · 2024-07-03T11:02:54Z

Hi. I have an issue while running pyscenic on Zebrafish dataset. I downloaded the Stanaka motifs database and processed my pyscenic analysis. Just to add, I successfully ran pyscenic on both mouse and human datasets.

First I got an error due to the wrong version format of the file that I converted into v2 without problem. Then I generated the adj.csv file that threw no error and is filled. The problem arises with the command pyscenic ctx - I got the usual warnings "Less than 80% of the genes in Regulon for XXXX could be mapped to v2_zf1.genes_vs_motifs.rankings. Skipping this module." but at the end, my reg.csv file is empty.

So I imported the genes from the feather file of Stanaka, and intersected these with the genes from my expression matrix, just to ensure that there is a significant overlap between gene names. For respectively 23000 and 20000 genes in both lists, I got +- 15000 genes in common. Given this, in my understanding, I would expect some regulons to survive the pruning step?
So I am wondering if it's not because of something else?

Thank you in advance for your time,

VincentGardeux · 2024-07-04T08:53:02Z

Hi @colin893,

I'm still following the updates on this issue, but so far did not see any answer from the devs/authors. :(

For your issue, did you try all the solutions I suggested in my post? In my case they substantially increased the number of regulons.

If you still loose all your regulons at the pruning stage, you can probably remove the pruning completely using filter_for_annotation=False
I have to admit that I tend to do that now, because my "preferred regulons" are always filtered out for reasons I don't understand.

Cheers

colin893 · 2024-07-08T14:15:13Z

Hi @VincentGardeux ,

When I posted I had only verified that there exist overlap between gene names in my expression matrix and the gene names in the databases.

Thanks for your different hints, I just launched the ctx command with --no_pruning and I got 599 regulons. I guess there might be a way to allow the pruning but, since I successfully analyzed human/mouse data, what comes to my mind is that the default values used for these are too stringent for zebrafish analyses? I'll have to do tests and modify some threshold values to ensure that hypothesis.

I'll update my response depending on what I get, thanks for your help!

ramadatta · 2024-10-22T20:10:25Z

@VincentGardeux I have been struggling to find a solution for this from past 3 days. Thank you for posting this issue. My preferred regulons does not show up in the output.

While --no_pruning option has increased my regulon number from ~220 to ~1260, still I miss some of my regulons.
I am directly running pyscenic ctx and do not see the option filter_for_annotation.

Am I missing something here?

VincentGardeux · 2024-10-23T08:18:23Z

Hi @ramadatta,

I guess you are running the tool in command line and not within Python. So the equivalent method for filter_for_annotation seems to be the --no_pruning option that you already use.
When you do that, you end up with the full list of TFs that you've provided at step 1 in the corresponding file. So in your case you probably have a list of ~1260 TFs, without the ones you are looking for.

However, something else that I realized since my initial post-at least for the Drosophila pipeline that I'm using-is that the TF list that you need to provide at Step 1 (pyscenic grn, Building the GRN), and that is provided by SCENIC, is partially wrong.
For example, in my case I was using the allTFs_dmel.txt file that they provide. But after a careful examination of the TFs by collaborators, we found that this list contains lots of genes that are NOT TFs... Similarly, some of the TFs that we were looking for were not in the list either (maybe this is your issue here). So we ended up providing another list, computed from FlyBase (the main resource for Drosophila). And it provided a completely different result !

Hope this helps

NB: To updated the other users, I still don't have any news from the maintainers of the tool :(

ramadatta · 2024-10-28T12:50:57Z

@VincentGardeux Thank you for the detailed answer. This could be a possible explanation of why some of the TFs are not coming up in the final analyis. It seems to be better to overlap the TFs with other studies as well to generate a curated list of markers.

On other note, I have 1881 TFs found in my object out of 1892 TFs provided from as an input with no_pruning option, I ended up with ~1260 TFs. Probably, the cutoffs as you mentioned auc_threshold, rho_mask_dropouts may also be filtering out some of the regulons.

VincentGardeux added the bug Something isn't working label Mar 21, 2024

baptisteavot-ukdri mentioned this issue Sep 27, 2024

[BUG] The whole AUC matrix are 0 values #581

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

VincentGardeux commented Mar 21, 2024 •

edited

Loading

caochch commented Apr 8, 2024

Flu09 commented Jun 24, 2024

rrydbirk commented Jun 25, 2024

colin893 commented Jul 3, 2024

VincentGardeux commented Jul 4, 2024

colin893 commented Jul 8, 2024 •

edited

Loading

ramadatta commented Oct 22, 2024

VincentGardeux commented Oct 23, 2024

ramadatta commented Oct 28, 2024 •

edited

Loading

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

Comments

VincentGardeux commented Mar 21, 2024 • edited Loading

caochch commented Apr 8, 2024

Flu09 commented Jun 24, 2024

rrydbirk commented Jun 25, 2024

colin893 commented Jul 3, 2024

VincentGardeux commented Jul 4, 2024

colin893 commented Jul 8, 2024 • edited Loading

ramadatta commented Oct 22, 2024

VincentGardeux commented Oct 23, 2024

ramadatta commented Oct 28, 2024 • edited Loading

VincentGardeux commented Mar 21, 2024 •

edited

Loading

colin893 commented Jul 8, 2024 •

edited

Loading

ramadatta commented Oct 28, 2024 •

edited

Loading