Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

updated usage example #39

Open
taylorreiter opened this issue Oct 26, 2023 · 16 comments
Open

updated usage example #39

taylorreiter opened this issue Oct 26, 2023 · 16 comments

Comments

@taylorreiter
Copy link
Member

I tried to follow the usage example outlined at https://dib-lab.github.io/kSpider/, but the instructions no longer work. Specifically, the indexing step seems to have changed from kSpider index_kmers... to kSpider index, with many of the arguments in the example no longer options to the new command. Would you be willing to provide updated instructions for how to cluster with kSpider? My use case is clustering isoforms in a de novo transcriptome when we have no knowledge of which genes each isoform/contig encodes. All of my transcripts are in a single FASTA file and I would like to predict which encode the same isoforms by clustering.

@taylorreiter
Copy link
Member Author

hi! I just wanted to check in again and see if this is something you could provide guidance on. Due to changes in the CLI I wasn't able to figure out how to run kSpider to cluster transcripts by shared k-mers. I'm super excited by the initial results you shared and would love to try it on more data

@mr-eyes
Copy link
Member

mr-eyes commented Nov 21, 2023

Hi, thanks for your interest in kSpider! I appreciate your patience on this issue, and I will update it as soon as possible with clear instructions on how to use the dev branch on this project to cluster your sequences. As the project's emphasis has recently shifted towards clustering large datasets rather than individual sequences, this might involve some development time to resolve the issue. Thank you!

@taylorreiter
Copy link
Member Author

Thanks for the update @mr-eyes! I look forward to trying it out when you have the time to put it together!

@taylorreiter
Copy link
Member Author

Hi Mo, i just wanted to circle back around on this and see if you have any bandwidth for this issue any time soon. I built another transcriptome today and experienced simultaneous deep loathing for how im clustering isoforms currently and excitement that I could potentially use this approach. I know you're busy but I just wanted to check in.

If you don't have bandwidth any time soon, do you think it is possible to do this type of clustering with sourmash (either the CLI or the python API)? I was very excited by the results you shared with me in slack about this type of approach as a proof of concept that the approach works.

@mr-eyes
Copy link
Member

mr-eyes commented Mar 7, 2024

Hi @taylorreiter
I am very sorry for not following up on the issue; I am swamped with my Ph.D.

I believe Branchwater can tackle this with its efficient brute force approach with a scale of 1. The pairwise command is implemented in branchwater here sourmash-bio/sourmash_plugin_branchwater#181
You will need to create a signature for every protein/isoform for the input signatures, and branchwater can efficiently work on them. You can also use sourmash-bio/sourmash_plugin_branchwater#198 for downstream clustering analysis or sourmash-bio/sourmash_plugin_branchwater#234

Please let me know if you have further questions.

@mr-eyes
Copy link
Member

mr-eyes commented Mar 7, 2024

@ctb @bluegenes Do have better approaches instead of having a signature for every sequence?

@bluegenes
Copy link
Member

bluegenes commented Mar 7, 2024

No (not yet), but the next branchwater plugin release will be able to sketch singletons to make sketching faster, at least

@bluegenes
Copy link
Member

bluegenes commented Mar 7, 2024

@taylorreiter I/we would love your feedback if you try out pairwise --> cluster in the branchwater plugin, though.

A couple tips:

  • use the pairwise --write-all option to make sure you keep all sketches, otherwise pairwise will not write entries with no similarity with other sketches (meaning we lose access to sketches that would end up as singleton "clusters"). Note, this only matters if you care about having all input sketches represented in your cluster output.
  • use pairwise --ani if you want to cluster by ANI instead of containment/jaccard/etc

@mr-eyes
Copy link
Member

mr-eyes commented Mar 7, 2024

No (not yet), but the next branchwater plugin release will be able to sketch singletons to make sketching faster, at least

@taylorreiter, then you can write a signature for every sequence through the Sourmash API, and then Branchwater can handle it from here.

@mr-eyes
Copy link
Member

mr-eyes commented Mar 7, 2024

@taylorreiter I/we would love your feedback if you try out pairwise --> cluster in the branchwater plugin, though.

A couple tips:

  • use the pairwise --write-all option to make sure you keep all sketches, otherwise pairwise will not write entries with no similarity with other sketches (meaning we lose access to sketches that would end up as singleton "clusters").
  • use pairwise --ani if you want to cluster by ANI instead of containment/jaccard/etc

@bluegenes I didn't understand why you need to keep edgeless nodes. Could you please elaborate on why this could be a problem in sparse comparisons?

@mr-eyes
Copy link
Member

mr-eyes commented Mar 7, 2024

Oh, I got it; you mean single-node clusters.

@bluegenes
Copy link
Member

bluegenes commented Mar 7, 2024

@mr-eyes it's really just a question of your output expectation. I was wanting all of my original sequences to be represented in the "clusters" output, whether they are singletons or part of a larger cluster. If we are just using the original pairwise output, sketches with no similarity to any other sketch are not represented in the output file.

Then, when we run cluster from this output file, I keep the 'singletons', which would be sketches that show up, but had similarity that did not pass threshold for connecting via an edge. But then the output has singletons, but not singletons that were not in the pairwise output.

@bluegenes
Copy link
Member

Oh, I got it; you mean single-node clusters.

yep!

@taylorreiter
Copy link
Member Author

Thank you both! I think I should be able to try this approach out next week! I think Mo showed that clustering with kSpider using a threshold of 16 (if I'm interpreting some graphs he shared with me correctly)...and I'm assuming the threshold here means containment of 16% of k-mers. Is this something that sourmash can do now, cluster by containment threshold? (I'll read the docs next week, but I wasn't aware that sourmash had that functionality!)

@mr-eyes
Copy link
Member

mr-eyes commented Mar 7, 2024

@taylorreiter Yes, branchwater can do that. The columns you will get in the output include containment and max_containment. Once you have the pairwise connections (edges), you can use the cluster command to get the clusters through the weakly_connected_components algorithm or the script I shared in this PR to get a distance matrix.

However, I also recommend exploring the community detection ideas here sourmash-bio/sourmash_plugin_branchwater#252

@mr-eyes
Copy link
Member

mr-eyes commented Mar 8, 2024

Maybe a related issue sourmash-bio/sourmash#2816

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants