Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Long run time #4

Open
DKaukonen opened this issue Mar 19, 2024 · 1 comment
Open

Long run time #4

DKaukonen opened this issue Mar 19, 2024 · 1 comment

Comments

@DKaukonen
Copy link

DKaukonen commented Mar 19, 2024

Hi,

This is a nice package and the documentation is helpful. There is one issue I am having. I am running the following command

consensus_cluster(data, k_max=15, n_reps=100, p_sample =0.8, p_feature=0.8).

My data comes from 8 samples totaling 36 260 cells and 3664 genes. It is scaled data. When I run that code, it says it will take an estimated 5 days to run. I do have 256GB of memory and 64 cores. Is there a way to run this command in parallel? I need to check up to 90 clusters, so taking 5 days to do 15 at a time will take a long time.

Also, is it normal to take 5 days to check the first 15 clusters?

Thank you,
-Damien

@AndiMunteanu
Copy link
Contributor

Hello, Damien!

Thank you for your question and sorry for reaching out this late!
Unfortunately, I have limited availability to further improve the performance of the PAC component and I do not think there will be improvements done on this section of the package in the near future.

However, as suggested here, you can try downsampling your dataset using methods such as geosketch, infer the appropriate number of cluster on the subsample and then use this information to cluster your entire dataset.

Hopefully this helps.

Andi

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants