Add support for targeted sequencing #163

mbhall88 · 2022-10-07T06:03:35Z

This PR will (eventually/hopefully) close #28

Added

option to compress JSON output (-z/--compress)
specific support for targeted sequencing (-T/--targeted). Takes a BED file of the regions targeted to aid in the read depth estimation [[Add support for targeted sequencing #28][28]]
The script scripts/combine.py to combine mykrobe reports into a single CSV file

The major feature addition in this PR is the targeted option. It takes a BED file of the regions the user has amplified and only calculates the expected depth from variants that fall within these regions.

Motivation

So this implementation has been motivated by some Nanopore targeted sequencing data I have been working with lately. There are two amplification strategies used: PCR and RPA. In particular, RPA cannot amplify an entire gene, so regions of genes that have the most variants in a WHO catalogue were chosen. One consequence of this when running vanilla mykrobe is that there are lots of variant with median depth 0/1, which end up drowing out the median depth calculate. To make matters more interesting, some of these samples were sequenced for up to 3 days, so the read depth is crazy high (~100k for some genes).

For example, here is a histogram of variant median depths for one sample (subsampled to theoretical coverage 10,000x)

The median depth (expected depth) from mykrobe is 1,074. Using this targeted approach, we get the following histogram

The median depth (expected depth) from mykrobe is 4,208. Using this targeted approach, we get the following histogram.

In addition to this, comparing DST predictions, before this option, I was finding lots of poor quality indels being called (e.g. #139; see our discussion in the slack channel too). After this change, I didn't see any of those.

One last thing to consider is whether we need to change the min. proportion expected depth. This is potentially just because of our high depth, but I have come across a case where we filter out a good call because it only has 14% of the expected depth. This isn't crazy low in the context of targeted sequencing where it is possible different genes amplify differently. I'm also not sure whether it is normal for targeted sequencing to produce much higher depths than WGS (I suspect it does) and so lowering the default from 0.3 to something like 0.1 when using the targeted option might not be insane? Thoughts?

iqbal-lab · 2022-10-07T23:05:03Z

The bed file is an excellent idea, and indeed i think the only way to support targetted sequencing well is (IMO) to know the targets and estimate depth on each independently. I also think it makes sense to set some limit, and once coverage goes above that, subsample reads randomly per target/amplicon.

Yes, it is typical to get much higher depth with amplicon seq; it is sometimes also an intention to use this to spot minor alleles. I think we do need to set some boundaries on what we are trying to do though, we can't support everything and handling everything between 10x and 10000x depth is complicated.

If people want to do exploratory work with super-high depth, they can always select reads for each amplicon, subsample those, use racon to get a consensus, and then remap all of the reads to that consensus to spot minor alleles.

personally, i'd vote for something like a max read depth of 100x, and then feed straight into standard mykrobe.

What do you think of that?

mbhall88 · 2022-10-08T05:13:25Z

Are you saying to set a max read depth on a variant, or are you talking about subsampling the reads within mykrobe?

iqbal-lab · 2022-10-08T05:37:55Z

I'm talking about subsampling

mbhall88 · 2022-10-08T05:42:24Z

That seems well outside of the scope of mykrobe. Besides, subsampling in this way is not trivial and I don't think we want to go there. You can't just do a random thing like rasusa because if there's uneven coverage on your targets, it's quite possible you will lose reads from one or more targets. See mbhall88/rasusa#17

iqbal-lab · 2022-10-08T05:47:58Z

Don't see how it's more outside scope than taking a bedfile, or coping with 100000x depth

iqbal-lab · 2022-10-08T05:54:22Z

When we originally added "support " for targeted sequencing we were rather cavalier, hoping a rough average depth would work. After the covid work, I'm more inclined to just accept we cannot properly support targeted sequencing and just offer something limited , if at all. Proper handling of multiple (>1) minor alleles when you have huge depth, and distinguishing nanopore homopolymer noise is going to be hard. That's why I proposed subsampling. Which, yes, requires specific code to subsample per target.

iqbal-lab · 2022-10-08T06:02:39Z

Open to suggestions/disagreement! Could have a wrapper repo that does the bedfile and subsampling? Have a collaborator wanting to do targeted sequencing and am about to help them do subsampling plus racon (ie completely avoid Mykrobe), so interested in your opinion

mbhall88 · 2022-10-08T06:33:13Z

We are definitely going to be needing to use this in the future so I am happy to work up a solution (I might also try drprg on this data and see what happens).

So we're going to need to do some kind of alignment to be able to subsample targets right?

We could do minimap2 as it has a python API? Only pitfall here is I don't know whether or not windows can deal with that?

iqbal-lab · 2022-10-14T07:56:06Z

My suggestion is this.

Create a new repo which does the following
Take bed file of targets, and create one mykrobe panel for each (eg single gene mykrobe)
Map reads, assign to each target, estimate depth, subsample if necessary
Run mini mykrobe on each target independently passing in those reads only
Combine results

mbhall88 · 2022-10-16T23:00:27Z

Alright. That will be a decent amount of work. Before I do that, I might take a look at how drprg goes on the amplicon data.

mbhall88 added 10 commits September 29, 2022 14:45

add targeted option

50daf00

Create combine.py

93500d5

add combine.py to PATH

6a88090

add panel name to ref data and logging

87b6c96

add option to compress json

8817134

add targeted bed file option

856f09b

Update CHANGELOG.md

5791df7

Update Dockerfile

d50535d

skip adding covg to estimate depths

aedded8

don't exclude variants not in target regions

4baeebd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for targeted sequencing #163

Add support for targeted sequencing #163

mbhall88 commented Oct 7, 2022 •

edited

Loading

iqbal-lab commented Oct 7, 2022

mbhall88 commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

mbhall88 commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

mbhall88 commented Oct 8, 2022 •

edited

Loading

iqbal-lab commented Oct 14, 2022

mbhall88 commented Oct 16, 2022

Add support for targeted sequencing #163

Are you sure you want to change the base?

Add support for targeted sequencing #163

Conversation

mbhall88 commented Oct 7, 2022 • edited Loading

Added

Motivation

iqbal-lab commented Oct 7, 2022

mbhall88 commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

mbhall88 commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

iqbal-lab commented Oct 8, 2022

mbhall88 commented Oct 8, 2022 • edited Loading

iqbal-lab commented Oct 14, 2022

mbhall88 commented Oct 16, 2022

mbhall88 commented Oct 7, 2022 •

edited

Loading

mbhall88 commented Oct 8, 2022 •

edited

Loading