Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

About gray lines... #51

Open
danilkotelnikov opened this issue Nov 11, 2024 · 4 comments
Open

About gray lines... #51

danilkotelnikov opened this issue Nov 11, 2024 · 4 comments
Labels

Comments

@danilkotelnikov
Copy link

Greetings, I didn't find any official contacts, e.g. mailing, so writing my question here... First, thank you a lot for such user friendly developed software! Appreciate it!

Second... Our junior bioinformatics team have been attempted to assemble a large Wild Soybean genome with two different strategies, so JupiterPlot was one of the tools we exactly needed to assess those ones. All genomes were assembled to contigs and reference scaffolded and than compared to that one reference, only different subsets of data differentiate between them. I'd like to know, how could I interpret the gray lines in those plots: are those local missassmblies, or just genomic events/features? Can we say that different subsets of data (used in different attempts) have significant impact on that dramatical differences depicted on two plots? Is there any text-formated information on how those lines are produced to analyse it statistically?

Thanks!

P.S. parameter were changed are the g=10000 and ng=80
JC_REF-REF_EXONT-QUER_RTCAFFS
JC_REF-REF_DEF-QUER_RTCAFFS

@JustinChu
Copy link
Owner

At the most basic level, grey links denote the start and end of link bundles. Even the large regions have grey lines at the start and end points of contiguous regions (though it may be hard to see). In theory, the tiny grey links could have colour in them too, but are so small the colour is not rendered. The grey lines are always rendered for every region even if very small according to link settings.

Interpreting these links really depends on the link settings (maxGap and minBundleSize). If both are set the same I think you can probably interpret the first plot as having more misassemblies or translocations on average (or at more that are larger in size, decrease minBundleSize to see). It is hard to determine if they are real biological differences or misassemblies, but seeing as the other plot is just as contiguous and isn't showing them, this could indicate that they are misassemblies.

In terms of text-formated information to analyze links statistically, that's a little tougher. Mostly I'm not sure how informative it is as a metric as links are both joined together and filtered based on upstream settings. However, if you really want to, you can parse the prefix.links.final file combined with the prefix.seqOrder.txt to determine what each scaffold is syntenic to what chromosome, and you could count the number of links that match the syntenic scaffold and the number that do not. If you used something like Quast, I'd expect maybe the metrics of misassemblies to correlate.

Let me know if that makes sense.

@danilkotelnikov
Copy link
Author

Appreciate your responding @JustinChu! Thanks for the tip, now that kinda makes sense to me, that those links could be affected due mixing raw sequencing data indeed. Assuming that, the first plot depicts our genome built with our own PE data and public ONT data (from SRA DB), while the second one using only our PE, so those ONT+PE-built scaffolds really shows that extra data may show us some unique genomic events (even though that we mix genomic data of two different plant lines of the same specie) or misassemblies, compared with reference genome. But the second plot doesn't show the same, because reference-based scaffolding was performed only using PE-derived contigs with very low length (max was approx 150k), so the genome is enough the same as the reference. Yet, thanks for help again!

@JustinChu
Copy link
Owner

JustinChu commented Nov 13, 2024

Interesting. So you have been doing reference guided scaffolding. If that is the case it makes sense why the results are so much cleaner on the second plot. Indeed the long read scaffolded assembly could show real variation/translocations.

The algorithm jupiterplot performs to line up the scaffolds is essentially the same as a reference guided scaffolder. In its original conception, in an ideal case you don't want to use Jupiter plot with the same reference you scaffolded it with. However I suppose if your goal is to show that the long reads helped reduce the bias of the reference guided assembly I think you could make that argument.

I would encourage looking into more de novo assembly methods in your assembly process and to additionally include reference free metrics of assembly correctness (e.g. tools that only use the reads to measure misassemblies/inconsistencies).

@danilkotelnikov
Copy link
Author

Yeah, we've already performed the last thing you suggest, but only needed a compact graphical depiction of the raw numerical results, just for internal purposes to show what those metrics mean for people of different specialization. All the types of de novo assemblies (with practically all de novo assemblers based on short reads) were attempted, as well, but neither of them satisfy our goal, except MaSuRCA (bc only PE data is insufficient for adequate plant genome assembly and we lack any kind of TGS machine in our lab to date), so ref-based scaffolding is just a bit better option for, yet again, our purposes. Thanks a lot for your help and useful information and tips on our job!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants