Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

AggregateSvPileup should account for inaccurate split-read breakpoint positions #13

Open
pamelarussell opened this issue May 24, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@pamelarussell
Copy link
Contributor

Currently AggregateSvPileup merges breakpoints that have left and right breakpoints within a distance threshold of each other, regardless of the type of read evidence of the breakpoints: split-read (breakpoint occurs inside sequenced read) or read-pair (breakpoint occurs in the unsequenced insert between mates).

However, these two types of evidence have different precision of the breakpoint position and should use different distance thresholds. While split-read evidence is likely to point to a very precise position, the position for a read-pair event can be off by as much as the inner distance (insert size minus read lengths). Something similar to the following procedure should be used instead:

  1. "Seed" clusters by clustering only breakpoints that have split-read evidence
  2. "Seed" additional clusters with breakpoints that have read-pair evidence
  3. Use read-pair events to aggregate clusters when the distance is within the inner distance (computed empirically by sampling)
@pamelarussell pamelarussell added the enhancement New feature or request label May 24, 2022
@tfenne
Copy link
Member

tfenne commented Sep 28, 2023

Agreed - I think a multi-pass strategy would work, though I think I would suggest something different:

  • Have parameters max-split-read-distance and read-pair-inner-distance (or compute the latter)
  • Aggregate events with split-read evidence within max-split-read-distance; this parameter should probably be set based on aligner parameters (e.g. a single sequencing error how far from the breakpoint would cause the read to get clipped at that point?)
  • Take all read-pair evidence and see if it can be said to support a single event defined by aggregating split reads, and if so assign it; in this case I think it should determine compatibility by whether the sum of the distances on both sides is < the max inner distance, rather than evaluating each side independently.
  • Take remaining read pairs, and if they could support multiple events, try and tie break based on position or split the count?
  • Take the remaining read pairs and cluster those independently

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants