-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
fix: classify mature miRNAs #146
base: dev
Are you sure you want to change the base?
Conversation
Merge with dev branch
…olanlab/mirflowz into 126-docs-describe-workflow-rationale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some places, the documentation of the script is written as if the script is a general purpose script for adding intersecting features to SAM files as a tag. But in other places, it becomes clear that there are several assumptions that limit the scope of the script (e.g., miRNA_ID
, shifts/extensions) to miRNAs and isomiRs. I think you could clarify that a bit better and perhaps also make a reference to the scripts that produce valid inputs to this file, because it is highly unlikely that someone would create the inputs for this script manually.
Btw, it would have actually been nice to design this script such that it actually is a general purpose script for adding "name" tags for intersecting features to SAM files and do all the other stuff (dealing with alignments that don't have an intersecting feature, dealing with maximum extensions etc) elsewhere. As it is, this script is quite complicated and has basically zero chance of reuse outside of this workflow.
Anyway, not important now - just lessons for the future :)
So please just clarify the scope of the script in the module-level docstring and I think we are ready to go.
…zavolanlab/mirflowz into 143-fix-classify-correctly-mature-mirna
I believe the script per se is pretty general: it adds a custom tag showing which features an alignment intersects with. I think the problem comes when defining variables and how it is documented. In this sense, the script appears to be only for miRNAs whose annotations might or might not been previously extended. But if the word I suggest to try and make the descriptions and names more general and if you do not see it clear, I will just revert the commit and document a more restricted scope. |
Yes, you can do that if you like. The shift stuff is still quite specific, but I think it's best to keep it, so that we can come to an end on this soon and publish the workflow :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the documentation issues, it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still quite a number of issues, including potential critical ones (possibly wrong calculation of shift_3
). I think this, together with the huge documentation that is still not sufficient, is very good evidence that the script is way too complex. I'm not suggesting to break it up at this point - but you absolutely need to be/make sure that what you do is 100% correct and properly and unambiguously documented.
|
||
for feat_name, feat_start, feat_end in intersecting_feat: | ||
shift_5 = alignment.reference_start - feat_start + 1 | ||
shift_3 = alignment.reference_start + alignment.query_length - feat_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that alignment.query_length
is correct here? What if there are insertions or deletions in the query? Wouldn't that throw off the position relative to the reference?
Wouldn't this do the trick instead?
shift_3 = alignment.reference_end - feat_end
if -shift <= shift_5 <= shift and -shift <= shift_3 <= shift: | ||
tags.append(f"{feat_name}|{shift_5}|{shift_3}|{cigar}|{md}|{seq}") | ||
|
||
return set(tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need a set here? Doesn't the code already ensure that the values in tags are unique?
concatenated using a semicolon as the separator. If no valid intersecting | ||
feature is found, the alignment is skipped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this last sentence accurate? It seems to me that the alignment is only skipped if there are no intersecting features or if none of the intersecting features pass the shift filter.
|
||
|
||
def parse_intersect_output( | ||
intersect_file: Path, ID: str = "name", extension: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ID
is a bad choice for a variable name, because:
- it violates the convention that variables are lower case and constants are upper case
- worse: it masks the built-in
id()
function
""" | ||
cigar = alignment.cigarstring | ||
seq = alignment.query_sequence | ||
md = alignment.get_tag("MD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it isn't there?
for line in bedfile: | ||
fields = Fields(*line.strip().split("\t")) | ||
|
||
feat_name = attributes_dictionary(fields.feat_attributes)[ID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the ID
attribute doesn't exist?
Fields = namedtuple( | ||
"Fields", | ||
( | ||
"feat_chr", | ||
"source", | ||
"feat_type", | ||
"feat_start", | ||
"feat_end", | ||
"feat_score", | ||
"strand", | ||
"phase", | ||
"feat_attributes", | ||
"read_chr", | ||
"read_start", | ||
"read_end", | ||
"read_name", | ||
"read_score", | ||
"read_strand", | ||
"overlap_len", | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a BED file? My bioinfo knowledge is a bit rusty, but I thought BED didn't have score, but rather had strand within the first 6 fields. It rather looks like GTF to me.
Also, what's the behavior if your "BED" file doesn't conform to the expected format (which, IMO, is 100% of the cases, because it's not BED at all - or maybe I'm missing sth)? I think you should properly validate it. Also, might be a good idea to use sth like BioPython to check if you are actually working with a valid extension of BED (or GTF) and not just some custom format.
Outcome: | ||
After adjusting the feature coordinates, both the 5' and 3' shifts are | ||
within the allowed range, so the tag | ||
"hsa-miR-1323|0|0|21M|21|TCAAAACTGAGGGGCATTTTC" is added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a shift, shouldn't at least one of the numbers not be a zero?
tags = [] | ||
|
||
for feat_name, feat_start, feat_end in intersecting_feat: | ||
shift_5 = alignment.reference_start - feat_start + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you 100% sure the offsets (0 vs 1) are properly considered, because this is always such a hassle with SAM being 1-based, BED not being 1-based, pysam being whatever-based... And the somewhat complex handling with shifts and extensions doesn't really help.
BTW, I think this script really needs alignments in the examples, so that it becomes perfectly clear what is what.
Outcome: | ||
The tag "hsa-miR-1323|0|0|21M|21|TCAAAACTGAGGGGCATTTTC" is appended | ||
because the alignment perfectly matches the feature coordinates. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think examples would be useful that (1) use only --shift
and (2) use only --extension
. It really doesn't help clarity that the available examples use either none or both.
Also, I think the examples would benefit from:
- Stating the use case in words (to clarify the use of
--shift
,--extensions
) etc. - Follow the use case with the command.
- Then show the BED and SAM records.
- Then add an actual alignment to visualize what is happening here.
- Add the actual output record.
- Finally add the verbal description/reasoning for the output.
This PR closes #143 .
The isomiR notation used until now was not unambiguous and lead to incorrect counts provided that the CIGAR and MD strings could be the same for different isomiR sequences. To account for this fact, the read sequence is added to the isomiR name.
The changes required to accomplish that are:
iso_name_tagging.py
script to add the read sequence on the isomiR nameiso_name_tagging.py
to account for the isomiR new name format, and the corresponding files. Given that the previous unit test file was not testing for the functions alone, those tests have been added.mirna_quantification.py
script to account for the new name format.mirna_quantification.py
to account for the new name format, and the corresponding files.In addition and to generalize the script scope:
iso_name_tagging.py
(--shift
)iso_name_tagging.py
in a more general wayquantify.smk
workflowiso_name_tagging.py
toannotate_sam_with_bed_features.py