updated usage example #39

taylorreiter · 2023-10-26T14:23:18Z

I tried to follow the usage example outlined at https://dib-lab.github.io/kSpider/, but the instructions no longer work. Specifically, the indexing step seems to have changed from kSpider index_kmers... to kSpider index, with many of the arguments in the example no longer options to the new command. Would you be willing to provide updated instructions for how to cluster with kSpider? My use case is clustering isoforms in a de novo transcriptome when we have no knowledge of which genes each isoform/contig encodes. All of my transcripts are in a single FASTA file and I would like to predict which encode the same isoforms by clustering.

The text was updated successfully, but these errors were encountered:

taylorreiter · 2023-11-17T14:58:05Z

hi! I just wanted to check in again and see if this is something you could provide guidance on. Due to changes in the CLI I wasn't able to figure out how to run kSpider to cluster transcripts by shared k-mers. I'm super excited by the initial results you shared and would love to try it on more data

mr-eyes · 2023-11-21T21:13:03Z

Hi, thanks for your interest in kSpider! I appreciate your patience on this issue, and I will update it as soon as possible with clear instructions on how to use the dev branch on this project to cluster your sequences. As the project's emphasis has recently shifted towards clustering large datasets rather than individual sequences, this might involve some development time to resolve the issue. Thank you!

taylorreiter · 2023-11-28T15:52:51Z

Thanks for the update @mr-eyes! I look forward to trying it out when you have the time to put it together!

taylorreiter · 2024-03-07T00:12:49Z

Hi Mo, i just wanted to circle back around on this and see if you have any bandwidth for this issue any time soon. I built another transcriptome today and experienced simultaneous deep loathing for how im clustering isoforms currently and excitement that I could potentially use this approach. I know you're busy but I just wanted to check in.

If you don't have bandwidth any time soon, do you think it is possible to do this type of clustering with sourmash (either the CLI or the python API)? I was very excited by the results you shared with me in slack about this type of approach as a proof of concept that the approach works.

mr-eyes · 2024-03-07T01:13:26Z

Hi @taylorreiter
I am very sorry for not following up on the issue; I am swamped with my Ph.D.

I believe Branchwater can tackle this with its efficient brute force approach with a scale of 1. The pairwise command is implemented in branchwater here sourmash-bio/sourmash_plugin_branchwater#181
You will need to create a signature for every protein/isoform for the input signatures, and branchwater can efficiently work on them. You can also use sourmash-bio/sourmash_plugin_branchwater#198 for downstream clustering analysis or sourmash-bio/sourmash_plugin_branchwater#234

Please let me know if you have further questions.

mr-eyes · 2024-03-07T01:29:31Z

@ctb @bluegenes Do have better approaches instead of having a signature for every sequence?

bluegenes · 2024-03-07T01:46:41Z

No (not yet), but the next branchwater plugin release will be able to sketch singletons to make sketching faster, at least

bluegenes · 2024-03-07T01:59:10Z

@taylorreiter I/we would love your feedback if you try out pairwise --> cluster in the branchwater plugin, though.

A couple tips:

use the pairwise --write-all option to make sure you keep all sketches, otherwise pairwise will not write entries with no similarity with other sketches (meaning we lose access to sketches that would end up as singleton "clusters"). Note, this only matters if you care about having all input sketches represented in your cluster output.
use pairwise --ani if you want to cluster by ANI instead of containment/jaccard/etc

mr-eyes · 2024-03-07T01:59:26Z

No (not yet), but the next branchwater plugin release will be able to sketch singletons to make sketching faster, at least

@taylorreiter, then you can write a signature for every sequence through the Sourmash API, and then Branchwater can handle it from here.

mr-eyes · 2024-03-07T02:01:25Z

@taylorreiter I/we would love your feedback if you try out pairwise --> cluster in the branchwater plugin, though.

A couple tips:

use the pairwise --write-all option to make sure you keep all sketches, otherwise pairwise will not write entries with no similarity with other sketches (meaning we lose access to sketches that would end up as singleton "clusters").

use pairwise --ani if you want to cluster by ANI instead of containment/jaccard/etc

@bluegenes I didn't understand why you need to keep edgeless nodes. Could you please elaborate on why this could be a problem in sparse comparisons?

mr-eyes · 2024-03-07T02:04:28Z

Oh, I got it; you mean single-node clusters.

bluegenes · 2024-03-07T02:05:30Z

@mr-eyes it's really just a question of your output expectation. I was wanting all of my original sequences to be represented in the "clusters" output, whether they are singletons or part of a larger cluster. If we are just using the original pairwise output, sketches with no similarity to any other sketch are not represented in the output file.

Then, when we run cluster from this output file, I keep the 'singletons', which would be sketches that show up, but had similarity that did not pass threshold for connecting via an edge. But then the output has singletons, but not singletons that were not in the pairwise output.

bluegenes · 2024-03-07T02:05:49Z

Oh, I got it; you mean single-node clusters.

yep!

taylorreiter · 2024-03-07T23:32:41Z

Thank you both! I think I should be able to try this approach out next week! I think Mo showed that clustering with kSpider using a threshold of 16 (if I'm interpreting some graphs he shared with me correctly)...and I'm assuming the threshold here means containment of 16% of k-mers. Is this something that sourmash can do now, cluster by containment threshold? (I'll read the docs next week, but I wasn't aware that sourmash had that functionality!)

mr-eyes · 2024-03-07T23:38:23Z

@taylorreiter Yes, branchwater can do that. The columns you will get in the output include containment and max_containment. Once you have the pairwise connections (edges), you can use the cluster command to get the clusters through the weakly_connected_components algorithm or the script I shared in this PR to get a distance matrix.

However, I also recommend exploring the community detection ideas here sourmash-bio/sourmash_plugin_branchwater#252

mr-eyes · 2024-03-08T20:34:42Z

Maybe a related issue sourmash-bio/sourmash#2816

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updated usage example #39

updated usage example #39

taylorreiter commented Oct 26, 2023

taylorreiter commented Nov 17, 2023

mr-eyes commented Nov 21, 2023

taylorreiter commented Nov 28, 2023

taylorreiter commented Mar 7, 2024

mr-eyes commented Mar 7, 2024

mr-eyes commented Mar 7, 2024

bluegenes commented Mar 7, 2024 •

edited

Loading

bluegenes commented Mar 7, 2024 •

edited

Loading

mr-eyes commented Mar 7, 2024

mr-eyes commented Mar 7, 2024 •

edited by bluegenes

Loading

mr-eyes commented Mar 7, 2024

bluegenes commented Mar 7, 2024 •

edited

Loading

bluegenes commented Mar 7, 2024

taylorreiter commented Mar 7, 2024

mr-eyes commented Mar 7, 2024

mr-eyes commented Mar 8, 2024

updated usage example #39

updated usage example #39

Comments

taylorreiter commented Oct 26, 2023

taylorreiter commented Nov 17, 2023

mr-eyes commented Nov 21, 2023

taylorreiter commented Nov 28, 2023

taylorreiter commented Mar 7, 2024

mr-eyes commented Mar 7, 2024

mr-eyes commented Mar 7, 2024

bluegenes commented Mar 7, 2024 • edited Loading

bluegenes commented Mar 7, 2024 • edited Loading

mr-eyes commented Mar 7, 2024

mr-eyes commented Mar 7, 2024 • edited by bluegenes Loading

mr-eyes commented Mar 7, 2024

bluegenes commented Mar 7, 2024 • edited Loading

bluegenes commented Mar 7, 2024

taylorreiter commented Mar 7, 2024

mr-eyes commented Mar 7, 2024

mr-eyes commented Mar 8, 2024

bluegenes commented Mar 7, 2024 •

edited

Loading

bluegenes commented Mar 7, 2024 •

edited

Loading

mr-eyes commented Mar 7, 2024 •

edited by bluegenes

Loading

bluegenes commented Mar 7, 2024 •

edited

Loading