About gray lines... #51

danilkotelnikov · 2024-11-11T05:22:02Z

Greetings, I didn't find any official contacts, e.g. mailing, so writing my question here... First, thank you a lot for such user friendly developed software! Appreciate it!

Second... Our junior bioinformatics team have been attempted to assemble a large Wild Soybean genome with two different strategies, so JupiterPlot was one of the tools we exactly needed to assess those ones. All genomes were assembled to contigs and reference scaffolded and than compared to that one reference, only different subsets of data differentiate between them. I'd like to know, how could I interpret the gray lines in those plots: are those local missassmblies, or just genomic events/features? Can we say that different subsets of data (used in different attempts) have significant impact on that dramatical differences depicted on two plots? Is there any text-formated information on how those lines are produced to analyse it statistically?

Thanks!

P.S. parameter were changed are the g=10000 and ng=80

JustinChu · 2024-11-12T21:57:06Z

At the most basic level, grey links denote the start and end of link bundles. Even the large regions have grey lines at the start and end points of contiguous regions (though it may be hard to see). In theory, the tiny grey links could have colour in them too, but are so small the colour is not rendered. The grey lines are always rendered for every region even if very small according to link settings.

Interpreting these links really depends on the link settings (maxGap and minBundleSize). If both are set the same I think you can probably interpret the first plot as having more misassemblies or translocations on average (or at more that are larger in size, decrease minBundleSize to see). It is hard to determine if they are real biological differences or misassemblies, but seeing as the other plot is just as contiguous and isn't showing them, this could indicate that they are misassemblies.

In terms of text-formated information to analyze links statistically, that's a little tougher. Mostly I'm not sure how informative it is as a metric as links are both joined together and filtered based on upstream settings. However, if you really want to, you can parse the prefix.links.final file combined with the prefix.seqOrder.txt to determine what each scaffold is syntenic to what chromosome, and you could count the number of links that match the syntenic scaffold and the number that do not. If you used something like Quast, I'd expect maybe the metrics of misassemblies to correlate.

Let me know if that makes sense.

danilkotelnikov · 2024-11-13T06:03:12Z

Appreciate your responding @JustinChu! Thanks for the tip, now that kinda makes sense to me, that those links could be affected due mixing raw sequencing data indeed. Assuming that, the first plot depicts our genome built with our own PE data and public ONT data (from SRA DB), while the second one using only our PE, so those ONT+PE-built scaffolds really shows that extra data may show us some unique genomic events (even though that we mix genomic data of two different plant lines of the same specie) or misassemblies, compared with reference genome. But the second plot doesn't show the same, because reference-based scaffolding was performed only using PE-derived contigs with very low length (max was approx 150k), so the genome is enough the same as the reference. Yet, thanks for help again!

JustinChu · 2024-11-13T22:34:25Z

Interesting. So you have been doing reference guided scaffolding. If that is the case it makes sense why the results are so much cleaner on the second plot. Indeed the long read scaffolded assembly could show real variation/translocations.

The algorithm jupiterplot performs to line up the scaffolds is essentially the same as a reference guided scaffolder. In its original conception, in an ideal case you don't want to use Jupiter plot with the same reference you scaffolded it with. However I suppose if your goal is to show that the long reads helped reduce the bias of the reference guided assembly I think you could make that argument.

I would encourage looking into more de novo assembly methods in your assembly process and to additionally include reference free metrics of assembly correctness (e.g. tools that only use the reads to measure misassemblies/inconsistencies).

danilkotelnikov · 2024-11-14T03:58:17Z

Yeah, we've already performed the last thing you suggest, but only needed a compact graphical depiction of the raw numerical results, just for internal purposes to show what those metrics mean for people of different specialization. All the types of de novo assemblies (with practically all de novo assemblers based on short reads) were attempted, as well, but neither of them satisfy our goal, except MaSuRCA (bc only PE data is insufficient for adequate plant genome assembly and we lack any kind of TGS machine in our lab to date), so ref-based scaffolding is just a bit better option for, yet again, our purposes. Thanks a lot for your help and useful information and tips on our job!

JustinChu added the question label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About gray lines... #51

About gray lines... #51

danilkotelnikov commented Nov 11, 2024

JustinChu commented Nov 12, 2024

danilkotelnikov commented Nov 13, 2024

JustinChu commented Nov 13, 2024 •

edited

Loading

danilkotelnikov commented Nov 14, 2024

About gray lines... #51

About gray lines... #51

Comments

danilkotelnikov commented Nov 11, 2024

JustinChu commented Nov 12, 2024

danilkotelnikov commented Nov 13, 2024

JustinChu commented Nov 13, 2024 • edited Loading

danilkotelnikov commented Nov 14, 2024

JustinChu commented Nov 13, 2024 •

edited

Loading