-
Notifications
You must be signed in to change notification settings - Fork 20
FASTA input files for Garnet #31
Comments
@sgosline do you have any ideas about where to look next in the Garnet code? I could help debug, but I haven't read through many of the Garnet and chipsequtil files before. |
Yeah, the Fastq files are read in by the old chip seq util code here. I believe it was last updated by Adam Labadorf but I'm not sure. Is this being maintained at all? I haven't been using Garnet in my research. |
Hi all, I have a strong feeling that this issue has something to do with how values are stored in dictionaries. When these dicts are keyed by different values (ie. Could someone familiar with the codebase follow up on this? @sgosline @agitter |
@iamjli interesting observation. I'm not familiar with the Garnet codebase, but I'll note that dictionary ordering also caused test case failures in the Forest code long ago. In particular, msgsteiner was sensitive to the order in which information was presented. |
Is anyone planning to debug this? Otherwise we may need to deprecate this code. |
That might be wise, I don't have the bandwidth to debug until mid-November. |
Is there an alternative workflow that the Fraenkel lab is using to predict transcription factor activities from epigenomic and transcriptional data? If we could offer a replacement, I would be more supportive of deprecating the Garnet code. If we don't have a replacement, then we would be removing significant functionality only 2 years after the PLOS Comp Bio paper appeared and the same year as @AmandaKedaigle's protocol paper. In that case, I would prefer to wait until @sgosline has time to assess how difficult the debugging would be or recruit someone new to help maintain the code. |
Hi all, I'm finally getting to this now as I'm sitting in a lot of talks/planes this week. Sorry for the delay. I'm having an issue getting the code, created #32 to track. It might be my config or the fact that I'm in Europe, not sure. |
I'm still traveling but have narrowed the problem down to the scoring of the fasta files in |
Thanks for the update @sgosline, that's encouraging. If we need to, I would support making a behavior-breaking change that orders the dict elements as long as we bump the version number. I'd rather err on the side of having a stable implementation than matching past behavior given our limited time for maintenance. I'm not sure whether OrderedDict would be relevant here or if we should just sort the keys, which is a change we made several places in the Forest code to fix dict iteration inconsistencies. |
Who is administrator for this github repo? @zfrenchee ? Can you add Anthony Soltis to this project so he can participate? |
He's invited. Anthony, check out:
https://github.com/fraenkel-lab/OmicsIntegrator/invitations
…On Thu, Nov 8, 2018 at 8:58 AM Sara JC Gosline ***@***.***> wrote:
Who is administrator for this github repo? @zfrenchee
<https://github.com/zfrenchee> ? Can you add Anthony Soltis to this
project so he can participate?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACojfR3NrEAwSbsu81nu0PDSQX2_5YlLks5utDh0gaJpZM4WY0Rp>
.
--
Alex Lenail
Tufts University '16
|
I'm summarizing an issue that @omigueles reported via email. I'm not familiar enough with the Garnet code to resolve it myself. @sgosline suggested it may be related to TAMO.
@zfrenchee do you know who in the Fraenkel lab is actively working with Garnet and may be able to help @omigueles?
A description of the issue from @omigueles:
I was exploring the fasta files in the A549 example and I got a couple of doubts, I would really appreciate if you could help me.
In the example file the fasta headers look like this:
Playing around, first I eliminated the "space-1" at the end, which causes the program to crash, which I understand since the program expects a header from Galaxy. I eliminated it because I wanted to understand what it means, is it the strand? I thought it was defined as + in this specific header. Besides in the bedfile it looks as if the strand was actually not known...
If I change the "space-1" to "space." ( .) which should be the standard when you do not know the strand, I get completely different results for this example (33 TFs instead of 75 TFs) do you know why this could be happening?
If I try with a toy example in Galaxy I see that the fasta header looks like this (even if you have a "point" in the strand field of the bed file):
I completely understand that this could be caused by the Galaxy version used, and I can change the headers so that they look as in the example, but then what do I need to put in my header so that Garnet works properly? i.e. Would the next statements be correct?
if I know the strand :
and if I do not know it:
Thank you for your time and your attention.
(next update:)
I re-checked Galaxy and actually you can get the same format for your header if you choose "character delimited field values" writing "_" . Interestingly the last piece of information is the fourth field of the BED file which is "name". I was also looking through some other scripts in the OmicsIntegrator website and found that in https://github.com/fraenkel-lab/OmicsIntegrator/blob/master/src/chipsequtil/Fasta.py the field after the whitespace in the header is used as Fasta ID. Could it be something like that? The program might be expecting unique IDs and by giving "-1" or "." all the time perhaps is doing something odd.
(next update:)
But my question is the same, how does the "-1" affect the output and why?
To begin with that should not be a "-1" , according to the Galaxy example it should be "peak1", maybe it is a bug in Galaxy. Still the thing is that if I run the example with this "-1" I get 75 TFs just like you do, according to:
https://github.com/fraenkel-lab/OmicsIntegrator/tree/master/tests/integration_test_standard/events_to_genes_with_motifsregression_results_FOREST_INPUT.tsv
but if I put a "." (as in "no name" in the bed file) I get 33 TFs and if I replace "-1" in every header with the "name" field from the bed file, that is "peak1,peak2... peak3108", I get 105 TFs.
This behavior is with OmicsIntegrator at commit 28d9a75, the
v0.3.1
release.One partial resolution is to update our readme to clarify that
character delimited field values
is required in Galaxy.I've started to look at the Garnet code to see how the FASTA headers are used. Two instances are:
OmicsIntegrator/scripts/get_window_binding_matrix.py
Line 23 in 08b5548
where
key_func
is the identity function so the entire header is used as the id instead of only the second whitespace delimited token.and
OmicsIntegrator/scripts/get_window_binding_matrix.py
Lines 39 to 42 in 08b5548
where it looks like the name for the sequence is not being used.
The text was updated successfully, but these errors were encountered: