Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

train/test split is inefficient #44

Open
mortonjt opened this issue May 10, 2019 · 4 comments
Open

train/test split is inefficient #44

mortonjt opened this issue May 10, 2019 · 4 comments

Comments

@mortonjt
Copy link
Collaborator

This function is currently converting sparse bioms to a dense representation.
https://github.com/biocore/rhapsody/blob/master/rhapsody/util.py#L111

This maybe problematic when scaling these algorithms to very large datasets.

@fedarko
Copy link

fedarko commented May 15, 2019

I think this is related to biocore/biom-format#808 —until recently, the DataFrame produced by biom.Table.to_dataframe() was effectively dense. This caused some silly scaling problems with rankratioviz, also.

For rankratioviz, what I ended up doing to address this problem was extracting the scipy.sparse.csr_matrix data from the BIOM table and using that to construct an actually-sparse DataFrame. Here's the code that does this (it's only ~4 lines of work between creating the DF and then populating its IDs). This should work even on BIOM versions with the biocore/biom-format#808 bug.

@mortonjt
Copy link
Collaborator Author

mortonjt commented May 15, 2019 via email

@nbokulich
Copy link
Contributor

what about only performing train/test split on the metadata? And then you subsample the biom tables accordingly. This is what is done in q2-sample-classifier, and could be applied here.

@mortonjt
Copy link
Collaborator Author

@nbokulich that makes a lot of sense. And the refactored solution will ultimately require a solution something like that. Furthermore, getting this right will ultimately help with scalability.

Currently looking into interleaving this with pytorch's DataLoader class - this will help parallel processing so that the GPUs will be saturated (should speed things up at least 2x fold).

@mortonjt mortonjt mentioned this issue Aug 14, 2019
5 tasks
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants