-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
train/test split is inefficient #44
Comments
I think this is related to biocore/biom-format#808 —until recently, the DataFrame produced by For rankratioviz, what I ended up doing to address this problem was extracting the |
Yes! Having the train/test split only applied to the biom tables would be a
huge plus. That's another item on the todo list...
…On Tue, May 14, 2019, 8:16 PM Marcus Fedarko ***@***.***> wrote:
I think this is related to
biocore/biom-format#808—until recently, the
DataFrame produced by biom.Table.to_dataframe() was effectively dense.
This caused some silly scaling problems with rankratioviz, also.
For rankratioviz, what I ended up doing to address this problem was
extracting the scipy.sparse.csr_matrix data from the BIOM table and using
that to construct an actually-sparse DataFrame. Here's the code that does
this
<https://github.com/fedarko/rankratioviz/blob/88c7c6fc8ed19e25c560b549cec02f23f35c7330/rankratioviz/generate.py#L134>
(it's only ~4 lines of work between creating the DF and then populating its
IDs). This should work even on BIOM versions with the
biocore/biom-format#808
<biocore/biom-format#808> bug.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#44>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA75VXNZ7J2MF6ZK73Z3PGTPVNI5HANCNFSM4HMF3STQ>
.
|
what about only performing train/test split on the metadata? And then you subsample the biom tables accordingly. This is what is done in q2-sample-classifier, and could be applied here. |
@nbokulich that makes a lot of sense. And the refactored solution will ultimately require a solution something like that. Furthermore, getting this right will ultimately help with scalability. Currently looking into interleaving this with pytorch's DataLoader class - this will help parallel processing so that the GPUs will be saturated (should speed things up at least 2x fold). |
This function is currently converting sparse bioms to a dense representation.
https://github.com/biocore/rhapsody/blob/master/rhapsody/util.py#L111
This maybe problematic when scaling these algorithms to very large datasets.
The text was updated successfully, but these errors were encountered: