fix: classify mature miRNAs #146

deliaBlue · 2024-08-18T17:29:07Z

This PR closes #143 .

The isomiR notation used until now was not unambiguous and lead to incorrect counts provided that the CIGAR and MD strings could be the same for different isomiR sequences. To account for this fact, the read sequence is added to the isomiR name.

The changes required to accomplish that are:

Modify the iso_name_tagging.py script to add the read sequence on the isomiR name
Modify the unit test for the script iso_name_tagging.py to account for the isomiR new name format, and the corresponding files. Given that the previous unit test file was not testing for the functions alone, those tests have been added.
Modify the mirna_quantification.py script to account for the new name format.
Modify the unit tests for the script mirna_quantification.py to account for the new name format, and the corresponding files.
Update the expected output of the pipeline to include the read sequence on the isomiR names

In addition and to generalize the script scope:

Add new CLI argument in iso_name_tagging.py (--shift)
Document iso_name_tagging.py in a more general way
Modify unit tests to account for the new CLI argument
Add new argument to the quantify.smk workflow
Rename iso_name_tagging.py to annotate_sam_with_bed_features.py

Merge with dev branch

merge with dev

…olanlab/mirflowz into 126-docs-describe-workflow-rationale

uniqueg

In some places, the documentation of the script is written as if the script is a general purpose script for adding intersecting features to SAM files as a tag. But in other places, it becomes clear that there are several assumptions that limit the scope of the script (e.g., miRNA_ID, shifts/extensions) to miRNAs and isomiRs. I think you could clarify that a bit better and perhaps also make a reference to the scripts that produce valid inputs to this file, because it is highly unlikely that someone would create the inputs for this script manually.

Btw, it would have actually been nice to design this script such that it actually is a general purpose script for adding "name" tags for intersecting features to SAM files and do all the other stuff (dealing with alignments that don't have an intersecting feature, dealing with maximum extensions etc) elsewhere. As it is, this script is quite complicated and has basically zero chance of reuse outside of this workflow.

Anyway, not important now - just lessons for the future :)

So please just clarify the scope of the script in the module-level docstring and I think we are ready to go.

scripts/iso_name_tagging.py

…zavolanlab/mirflowz into 143-fix-classify-correctly-mature-mirna

deliaBlue · 2024-08-31T20:54:18Z

In some places, the documentation of the script is written as if the script is a general purpose script for adding intersecting features to SAM files as a tag. But in other places, it becomes clear that there are several assumptions that limit the scope of the script (e.g., miRNA_ID, shifts/extensions) to miRNAs and isomiRs. I think you could clarify that a bit better and perhaps also make a reference to the scripts that produce valid inputs to this file, because it is highly unlikely that someone would create the inputs for this script manually.

Btw, it would have actually been nice to design this script such that it actually is a general purpose script for adding "name" tags for intersecting features to SAM files and do all the other stuff (dealing with alignments that don't have an intersecting feature, dealing with maximum extensions etc) elsewhere. As it is, this script is quite complicated and has basically zero chance of reuse outside of this workflow.

Anyway, not important now - just lessons for the future :)

So please just clarify the scope of the script in the module-level docstring and I think we are ready to go.

I believe the script per se is pretty general: it adds a custom tag showing which features an alignment intersects with. I think the problem comes when defining variables and how it is documented. In this sense, the script appears to be only for miRNAs whose annotations might or might not been previously extended. But if the word extension is changed to shift or range and the description changes to "Allowed shift range between either end of the alignment and the intersecting feature" (or maybe a better description but just for you to get the idea) then it gets more general and the script does not have to change that much.
Another example would be the way the tag format is specified. If instead of using miRNA_ID I use intersecting_feature any kind of sequence can be used as long as it has been intersected using Bedtools intersect.

I suggest to try and make the descriptions and names more general and if you do not see it clear, I will just revert the commit and document a more restricted scope.

uniqueg · 2024-09-02T09:46:04Z

Yes, you can do that if you like. The shift stuff is still quite specific, but I think it's best to keep it, so that we can come to an end on this soon and publish the workflow :)

uniqueg

Apart from the documentation issues, it should be fine.

.github/workflows/tests.yml

scripts/tests/test_iso_name_tagging.py

scripts/iso_name_tagging.py

scripts/tests/test_annotate_sam_with_bed_features.py

uniqueg

There are still quite a number of issues, including potential critical ones (possibly wrong calculation of shift_3). I think this, together with the huge documentation that is still not sufficient, is very good evidence that the script is way too complex. I'm not suggesting to break it up at this point - but you absolutely need to be/make sure that what you do is 100% correct and properly and unambiguously documented.

uniqueg · 2025-03-19T18:13:47Z