Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CNV-induced inversions #340

Open
adf-ncgr opened this issue Sep 10, 2020 · 7 comments
Open

CNV-induced inversions #340

adf-ncgr opened this issue Sep 10, 2020 · 7 comments
Labels

Comments

@adf-ncgr
Copy link
Contributor

despite #262
we are still seeing some false inversions in regions of CNV, e.g.:
https://legumeinfo.org/gcv2/gene;lis=phavu.Phvul.003G002400?algorithm=repeat&match=10&mismatch=-1&gap=-1&score=30&threshold=25&bmatched=20&bintermediate=10&bmask=10&linkage=average&cthreshold=20&neighbors=10&matched=4&intermediate=5&sources=lis&bregexp=&border=chromosome&regexp=&order=distance

per @alancleary :

Yeah, it's an inversion issue. Unfortunately the inversion algorithm is optimizing score, so that inversion is hard to avoid since the CNV makes the inversion's score higher than the non-inverted segment.

Yes, I already attempted to fix this. There's some heuristics in place to prevent inversions of tandem duplications. This may be an edge case caused by the orphans breaking up the duplications, but I can't say for sure. Feel free to open an issue if you want me to attempt another fix.

couple of things to note- in this case the orientation of the genes seems like it would be helpful, see #140 for some discussion of this in another context (not exactly sure it applies here); also, could consider something akin to "homopolymer compression" for minimizing the impact of CNV in such situations. If nothing else, perhaps introducing an inversion penalty would make sense.

@alancleary alancleary added the bug label Sep 10, 2020
@adf-ncgr
Copy link
Contributor Author

thanks @alancleary that's an interesting example, although technically I think it is not an inversion but rather a segmental duplication. This is perhaps more clear when looking at the dotplot although it is somewhat puzzling to me why we see more copies of some of the genes in the dotplot than appear to exist in the aligned track. Maybe some segments are getting ignored due to their scores (trying to fiddle with the params a bit and having trouble getting it to bend to my will- must be getting old!)

image

@alancleary
Copy link
Contributor

Well shoot. I think you're right; it's just a segmental duplication. I may need to start taking gene orientation into account again when aligning. That would probably make these false inverses way more uncommon (including the CNV case).

Regarding the difference between copies of genes in the dot plots vs the micro-synteny view, dots plots generate a circle for all pairs of genes that share the same family. That's why you get grids of dots when there's tandem duplications:
Screenshot 2021-09-17 at 14-43-47 Genome Context Viewer

@adf-ncgr
Copy link
Contributor Author

not "just" a segmental duplication- a very nice demonstration of a less common type of SV that you've handled nicely (IIRC it is the reason we brought in the repeat algorithm in the first place!)

I know about the CNV grid effect, but in the case of the segmental duplication I'd expect to see (for example) three copies of the "gold" gene in the aligned medsa.chr4.4 track to one in the query phavu.Chr02 track but in fact I see only 2 to 1 in the alignment as seen here
image

my guess is that the middle copy in the dotplot is getting hidden in the alignment for some technical reason but I could certainly be wrong!

@adf-ncgr
Copy link
Contributor Author

BTW, in the dotplot when you hover over a circle you only get info about the query track gene. would be nice to know what the pair represents! let me know if this is issue-worthy.

@alancleary
Copy link
Contributor

my guess is that the middle copy in the dotplot is getting hidden in the alignment for some technical reason but I could certainly be wrong!

Oh, I see. I guess I'll take a look at it while I'm working on this issue.

BTW, in the dotplot when you hover over a circle you only get info about the query track gene. would be nice to know what the pair represents! let me know if this is issue-worthy.

I actually had that same thought when playing with the dot plots just now. Definitely issue-worthy!

@adf-ncgr
Copy link
Contributor Author

Just want to bump this issue slightly (however symbolic an act that may be) after having had to convince myself (and a collaborator) that the "inversion" seen in this example: https://medicago.legumeinfo.org/tools/gcv/gene;medicago=medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0064841?q=medtr.A17.gnm5.MtrunA17Chr4:55872871-56368598&sources=medicago&algorithm=repeat&match=10&mismatch=-1&gap=-1&score=30&threshold=25&bmatched=20&bintermediate=10&bmask=10&bchrgenes=1&bchrlength=100000&linkage=average&cthreshold=20&neighbors=35&matched=4&intermediate=5&bregexp=&border=distance&regexp=&order=distance
is in fact a case of this issue and not something to include in a manuscript. What tipped me off was the mismatch in the orientation of the arrows with the query track in the flipped region, but I confirmed it with whole genome sequence alignment/dotplot inspection. I think this probably further confirms that including gene orientations in some way is important to consider when we consider alignment scoring.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants