Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Inconsistency in MergeBamAlignment #1569

Open
2 tasks
yfarjoun opened this issue Aug 29, 2020 · 3 comments
Open
2 tasks

Inconsistency in MergeBamAlignment #1569

yfarjoun opened this issue Aug 29, 2020 · 3 comments

Comments

@yfarjoun
Copy link
Contributor

A recent refactor in htsjdk and the changes in MergeBamAlignment that took advantage of said refactoring #1484, #1509, and #1513, result in occassionally inconsistent bam records:

for example:

readname 99 chr3 93470586 60 49M102S = 93470586 46 * * MC:Z:102S46M3S MQ:i:60 AS:i:71

Which has a read-length that is based on the aligned positions and clipping that is looking at the unclipped read start (note the 3S in the MateCigar)

While this change in behavior was not intentional, it does raise the question, which is the more correct approach?

  1. Assume that the first 5' base marks one end of the insert
  2. Assume that the first aligned 5' base marks one end of the insert.

Since I think that there should be only one interpretation for the size of the insert, and that interpretation should be used for both TLEN and clipping overlapping reads, and I do not wish to change the definition of TLEN, I think that the right thing to do would be to revert the behaviour of MBA back to the original, i.e. use the aligned 5' ends to determine the ends of the Insert, but I'm open to hearing other arguments and options.

Instructions

  • Use a concise yet descriptive title;
  • Determine whether your issue is a bug report, a feature request, or a documentation request;
  • Choose the corresponding template block below and fill it in, replacing or deleting text in italics (surrounded by _) as appropriate;
  • Delete the other template blocks and this header.

Bug Report

Affected tool(s)

Tool name(s), special parameters?

Affected version(s)

  • Latest public release version [version?]
  • Latest development/master branch as of [date of test?]

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.

Steps to reproduce

Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem and provide a minimal test case.

Expected behavior

Tell us what should happen

Actual behavior

Tell us what happens instead


Feature request

Tool(s) involved

Tool name(s), special parameters?

Description

Specify whether you want a modification of an existing behavior or addition of a new capability.
Provide examples, screenshots, where appropriate.


Documentation request

Tool(s) involved

Tool name(s), parameters?

Description

Describe what needs to be added or modified.


@yfarjoun
Copy link
Contributor Author

image

This is an example of how the aligned reads look in IGV. The top image is the new behavior, with less clipping, while the bottom has more clipping. As one can see, it seems that (at least in this case) the alignment itself is suspect since the top and bottom reads disagree with each other strongly outside the middle 47 bases.

@yfarjoun
Copy link
Contributor Author

However, in another example it looks like the problem is the quality of the sequencing as the bases agree with their mate a large percentage of the cases in the clipped parts:

image

@kachulis
Copy link
Contributor

kachulis commented Sep 1, 2020

@yfarjoun this comment is mostly a review from our offline discussion, adding here for transparency/documentation.

Personally, if not for agreement with past behavior, I don't think there is a clear "better" choice between the "old" and "new" behavior of MBA. I think it is relatively clear that in both of your examples the old behavior is soft clipping bases which are not adapter readthrough. This is clear in the first example because the clipped bases are different from the bases in the mate, and in the second example because the clipped bases agree with the reference. As you have mentioned though, the old behavior leads to a more consistent agreement between TLEN and the cigar, avoiding situations where we have more matched bases in a read than the supposed length of the template. This seems like a desirable feature of the old behavior.
Also, as you have observed in your two examples, read pairs with 5' clipping are generally pretty messy data to begin with. Since any downstream tools which would care a lot about these bases can (and probably do) just revert the soft clips anyway, it doesn't seem like a particularly consequential difference in behavior, from the perspective of allowing a downstream tool to run correctly and achieve good results.
Since I don't see either behavior as clearly "better" in the case of soft clipping, I agree with you that reverting to the old (pre #1484) behavior for soft clipping makes the most sense. I don't think the change in behavior in #1484 was intentional anyway.

However, I think the situation changes when we are hard clipping. While it seems acceptable to soft clip messy reads when we aren't quite sure what is going on, I think we need to be more conservative with hard clipping. Even though we store the clipped bases in a tag, that's really only helpful if someone notices a problem; I highly doubt any tool will be automatically checking those tags to see if we may have hard clipped some bases we shouldn't have. Both of the examples you provided above are situations where I wouldn't feel great about hard clipping all the bases that the old version of MBA soft clips. But even more concerning is a situation where the soft clip at the 5' end of the mate is due to a fairly large indel near the end of the read. An assembly based caller will be able to resolve this indel given evidence from other spanning reads, and these reads will provide additional evidence for the indel. I would not want to hard clip these bases off the 3' end of the read, and lose that additional evidence.

So, a proposed behavior for MBA:

  1. overlapping reads are soft clipped so that the 3' aligned end does not extend past the 5' aligned end of the mate. This reverts to the "old" MBA behavior when soft clipping
  2. if using hard clipping, reads are additionally hard clipped so that the 3' unclipped end of the read does not extend past the 5' unclipped end of the mate. This implements the "new" MBA behavior when hard clipping.

The result of this behavior is that:

  • When running in soft clipping mode, the results are identical to the "old" MBA
  • When running in hard clipping mode, the hard clipped bases are the same as in the "new" MBA. Additionally, any bases which would have been clipped by the "old" MBA but not by the "new" MBA are soft clipped.
  • Consistency between TLEN and the cigar is maintained for both situations, ie. we won't have reads with longer matches than the 5'-5' TLEN.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants