train/test split is inefficient #44

mortonjt · 2019-05-10T20:02:10Z

This function is currently converting sparse bioms to a dense representation.
https://github.com/biocore/rhapsody/blob/master/rhapsody/util.py#L111

This maybe problematic when scaling these algorithms to very large datasets.

fedarko · 2019-05-15T00:16:18Z

I think this is related to biocore/biom-format#808 —until recently, the DataFrame produced by biom.Table.to_dataframe() was effectively dense. This caused some silly scaling problems with rankratioviz, also.

For rankratioviz, what I ended up doing to address this problem was extracting the scipy.sparse.csr_matrix data from the BIOM table and using that to construct an actually-sparse DataFrame. Here's the code that does this (it's only ~4 lines of work between creating the DF and then populating its IDs). This should work even on BIOM versions with the biocore/biom-format#808 bug.

mortonjt · 2019-05-15T01:09:25Z

Yes! Having the train/test split only applied to the biom tables would be a huge plus. That's another item on the todo list...

…

On Tue, May 14, 2019, 8:16 PM Marcus Fedarko ***@***.***> wrote: I think this is related to biocore/biom-format#808—until recently, the DataFrame produced by biom.Table.to_dataframe() was effectively dense. This caused some silly scaling problems with rankratioviz, also. For rankratioviz, what I ended up doing to address this problem was extracting the scipy.sparse.csr_matrix data from the BIOM table and using that to construct an actually-sparse DataFrame. Here's the code that does this <https://github.com/fedarko/rankratioviz/blob/88c7c6fc8ed19e25c560b549cec02f23f35c7330/rankratioviz/generate.py#L134> (it's only ~4 lines of work between creating the DF and then populating its IDs). This should work even on BIOM versions with the biocore/biom-format#808 <biocore/biom-format#808> bug. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#44>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA75VXNZ7J2MF6ZK73Z3PGTPVNI5HANCNFSM4HMF3STQ> .

nbokulich · 2019-07-24T20:07:18Z

what about only performing train/test split on the metadata? And then you subsample the biom tables accordingly. This is what is done in q2-sample-classifier, and could be applied here.

mortonjt · 2019-07-24T20:22:59Z

@nbokulich that makes a lot of sense. And the refactored solution will ultimately require a solution something like that. Furthermore, getting this right will ultimately help with scalability.

Currently looking into interleaving this with pytorch's DataLoader class - this will help parallel processing so that the GPUs will be saturated (should speed things up at least 2x fold).

mortonjt mentioned this issue Aug 14, 2019

Version 1 #71

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train/test split is inefficient #44

train/test split is inefficient #44

mortonjt commented May 10, 2019

fedarko commented May 15, 2019 •

edited

Loading

mortonjt commented May 15, 2019 via email

nbokulich commented Jul 24, 2019

mortonjt commented Jul 24, 2019

train/test split is inefficient #44

train/test split is inefficient #44

Comments

mortonjt commented May 10, 2019

fedarko commented May 15, 2019 • edited Loading

mortonjt commented May 15, 2019 via email

nbokulich commented Jul 24, 2019

mortonjt commented Jul 24, 2019

fedarko commented May 15, 2019 •

edited

Loading