-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Duplicates missed when mate has low mapping quality #128
Comments
So this may be more challenging than initially thought. I tried a quick attempt here: https://github.com/broadinstitute/picard/tree/nh_mark_duplicates_with_low_q_end Consider the case if we treat read pairs where both ends map, but one end has low mapping quality (ex. 0). Lets say we treat them as fragments. What happens to the high mapq end of the pair. Well it will be marked as a duplicate if there are other pairs with one end at the same (5' unclipped) position, else it may not be marked as duplicate. This seems reasonable. But if the other end is randomly assigned to one of say 10 equally likely alignment positions (say we have large segmental duplications), then when we treat it as a fragment it may not get duplicate marked. We want them to be consistent between both ends (i.e. both are marked as duplicates or neither). We could treat any end of a pair below the mapping quality threshold as unmapped. Nonetheless, this will require some rigorous testing, and some serious thought, so I am deferring this until we have some time to devote to it. |
Interesting. So what I think you're saying is that it's going to be hard to remove this class of duplicates given the current implementation (which looks at start/end pos of the paired reads). Wasn't there an implementation of duplicate marking done by Chris Hartl (under the wing of Mauricio at the time and with Tim F advising) that looked at the sequence context of the reads rather than start/end pos? That should theoretically work really well here. I'll investigate. |
@eitanbanks I mean to say that it will take more than 30 minutes to implement so we are deferring. |
Yes, great. Yossi and I just talked through this and think we've come up with a solution that should work (I'll let him describe at his own leisure). I want an answer to a question about the Buick data now - that's why I'm looking at the other implementation today. |
@eitanbanks ran a foghorn implementation of MarkDuplicates tool that uses (a Fourier transform of) the 5' end sequences to identify the fragments. He reports that this resolves the problem described here. This presents two possibilities (as I see it):
It should be said that this problem presents mainly in low-complexity, high depth samples which are not something we are promoting...so this points to the second solution since it would entail minimal disruption and presumably be less time-consuming. |
If you are planning to change the behavior of MarkDuplicates, I would -Bob On 1/6/15, 4:32 PM, Yossi Farjoun wrote:
|
This issue also seems to come up in samples with high chimeric rates, some The screen shot is from sample Yossi. On Fri, Dec 19, 2014 at 11:53 AM, Eric Banks notifications@github.com
|
@yfarjoun should we be posting file paths here? |
@yfarjoun I have a bam with two chimeric read pairs in which:
This doesn't quite the fit the bill of the bug you describe above (since both reads do align the same), but seems related. Do you know why this happens - perhaps because there is no mate-cigar tag (why isn't there one, other reads in the bam have it)? I'm attaching the relevant portion of the .bam:
|
I see that the records are from different readgroups. Could you include Also, could you send the mates? I'd like to see the 5' end of their On Thu, Nov 19, 2015 at 2:24 AM, pmBarlev notifications@github.com wrote:
|
Perhaps the issue is that these reads are not primary alignments (see line 148 in MarkDuplicates.java)? Indeed, both the corresponding primary alignments and their mates are marked as duplicates. In any case, here are the read groups:
Here are the mates:
And the primary alignments
|
That indeed seems to be the issue. (the secondary alignements) On Thu, Nov 19, 2015 at 7:54 AM, pmBarlev notifications@github.com wrote:
|
This thread being well over a year, I'm closing it. Resurrect it if this is still a thing. |
When a pair-end read has one read with good mapping quality and the second with low (e.g. 0) mapping quality, the second may be placed in several locations randomly according to the aligner (even with the same aligner). This means that a duplicated such fragment will be incorrectly not marked as duplicate if the two version of the second read are aligned differently (note that this has nothing to do with secondary or supplemental alignments)
Perhaps the solution could be as simple as to add another condition to line 283 in MarkDuplicates.java (and perhaps a similar condition in MarkDuplicatesWithCigar?) that verifies that the mate is well mapped. I suspect that a MQ!=0 would suffice, though it depends on the aligner...so perhaps it could be an @option.
The text was updated successfully, but these errors were encountered: