Criteria for barplot output #59
-
Hi all, Quick question - can you explain the criteria used to determine which barplots are automatically generated in the output files? The github page says "barplots with GO annotations ranked by relevance for each input gene that GeneWalk was able to generate results for," but I'm not sure if this is supposed to explain why only a subset of the bar plots get produced and on what basis they are selected. Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
Hi @npokorzynski , In principle barplots are generated for all input genes. If a barplot for a gene is not generated, it is either because there are no GO annotations for that gene or it could not be included in the GeneWalk network for analysis. In the former case the gene is still listed in genewalk_results.csv, but with 0 GO annotations. The latter can happen when there is no uniprot ID for example: all those drop out genes with the reasons for drop out are listed in Cheers, |
Beta Was this translation helpful? Give feedback.
-
@ri23 Thank you for that reply. Perhaps I'm misunderstanding then: When I run genewalk, the genewalk_results.csv contains far more genes with associated GO terms (many of which are significant below FDR p 0.01) than bar plots produced. I also am observing that between conditions a variable number of bar plots are generated, so that gave me the impression that there was some criteria selecting which genes barplots would be produced for under a given condition. Perhaps you can clarify what is causing this? |
Beta Was this translation helpful? Give feedback.
-
This is unexpected, at the moment I am not sure what is happening. Can you clarify: do you mean some barplots are not generated in the Can you check what the confidence interval cilow_global_padj values are for genes that are present in genewalk_results.csv but without plot? Moving forward, would you be able to share your genewalk_results.csv and genewalk_all.log files and mention a few genes that did not have a bar plot generated? My email is robert_ietswaart@hms.harvard.edu if you don't want your files here on github in the public domain. Alternatively a screen shot of the above could also help to figure this out. Cheers, |
Beta Was this translation helpful? Give feedback.
-
@ri23 With some troubleshooting I may have revealed at least one part of the problem, though I'm not sure why this would cause issues: It turns out that if I run the analysis with Ensembl IDs instead of HGNC symbols, I produce the expected index file and barplots folder containing all barplots. In the above, I was getting the results csv file with all the relevant genes, but only a portion of the barplots and no index.html file. Here are the log and results files for the analysis I ran using hgnc_symbol as the ID: This has led me to a related but distinct question - in the analysis run with ensembl_id (from the same starting gene list just with the IDs changed, which contains 1500 genes) I receive an output of 975 genes with associated GO terms (which seems high to me, e.g. 65% of my input is returned as functionally relevant?), with very few even reaching FDR p < 0.1. In fact, 0.0946 is the lowest p-value I get from any of the results. If needed, I can start a new issue for this, but I've attached the log and results files for this analysis as well. |
Beta Was this translation helpful? Give feedback.
-
Hi @npokorzynski , From the log file with hgnc symbols as input, I suspect GeneWalk got killed after 3h. Could it be that you ran a job on a cluster or something with this time limit (from 14:52 to 17:52)? For the ensembl input, you ran it with parallelization (great!) so GeneWalk ran faster (~1h15min) and finished as expected. According to the log file: If you want to check you can rerun the hgnc symbol input with parallelization and it should provide all the barplots and index.html within the same time frame as your ensembl ID input run. Regarding your second question: Out of the 975 a lot can have 1 or more significant (global_padj<0.1) GO annotation per gene. In our publication, we mention that for the QKI context, this was the case for 54% of the genes. We interpret this as a lot genes have some significant GO annotation, but not all annotations are relevant for a given gene and different annotations are relevant for different genes (unlike the typical GO enrichment results, which test for global relevance, so you get the 1 GO term for a set of genes). Also yes the global_padj does not extend much below 0.1, this is because of the FDR multiple testing correction across all ~11k gene-GO term pairs. In the paper we do mention there is limited stochasticity between independent replicate runs (Supplementary Fig. S2), because GeneWalk relies on random walk sampling. As a result the cosine similarities, resulting p-value and FDR adjust p-values can vary slightly between runs. This is also why we provide 95% confidence intervals on the p-values. And this explains why some of the global_padj of your hgnc symbol run are slightly lower than 0.0946, the lowest value of your second run. If such variability due to stochasticity bothers you: you can choose to rerun GeneWalk with If you are interested in further gene selection, then focus in on the regulator scatter plot in the top right corner. There are the genes with a high fraction of relevant GO annotations and many gene-gene connections. Once you find a gene of interest from there, then look at the barplot for that gene to understand what their relevant functions are. So in conclusion all your output is as expected. I hope these clarifications help you interpret the results further. Best, |
Beta Was this translation helpful? Give feedback.
-
@ri23 Thank you that was incredibly helpful! Do you happen to have a code snippet to recreate the scatterplots? If not, I can close out this issue. |
Beta Was this translation helpful? Give feedback.
-
Hi @npokorzynski |
Beta Was this translation helpful? Give feedback.
-
@ri23 okay, I guess this must have also been a consequence of the incomplete run from my non-parallelized attempts - I had only previously observed the .png file as an output (the resolution of which is understandably not great). In the complete/parallelized run I get the .pdf file. In that case I'm all good, for now at least. Thanks again for all the help! |
Beta Was this translation helpful? Give feedback.
Hi @npokorzynski ,
In principle barplots are generated for all input genes. If a barplot for a gene is not generated, it is either because there are no GO annotations for that gene or it could not be included in the GeneWalk network for analysis. In the former case the gene is still listed in genewalk_results.csv, but with 0 GO annotations. The latter can happen when there is no uniprot ID for example: all those drop out genes with the reasons for drop out are listed in
genewalk_all.log
.Cheers,
Robert