Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Dataset storage needs improvement #80

Closed
ardunn opened this issue Oct 1, 2018 · 8 comments
Closed

Dataset storage needs improvement #80

ardunn opened this issue Oct 1, 2018 · 8 comments
Assignees

Comments

@ardunn
Copy link
Contributor

ardunn commented Oct 1, 2018

Encoding and decoding structures and compositions to/from csv causes problems with oxidation states and is also slow. Is anyone open to converting the matbench dataset format to json or pickle for less hassle loading?

@utf
Copy link
Member

utf commented Oct 1, 2018

I'd recommend using json over pickle here. Just because, even though json will be slower, other people might not be able to unpickle the file if they are using a different architecture. E.g. intel vs ARM processor.

If the data load is really really slow, perhaps we could look into using HDF5?

@ardunn
Copy link
Contributor Author

ardunn commented Oct 1, 2018

@utf have you ever ran into "OverflowError: Maximum recursion level reached" when using pandas.DataFrame.to_json? When converting any df having pmg structures to json, I get that error

@utf
Copy link
Member

utf commented Oct 1, 2018

Yep. I have a solution for it in the store_dataframe_as_json function in matminer: https://github.com/hackingmaterials/matminer/blob/538940afd4816e37333ae07811157328d79074a0/matminer/utils/io.py#L39

Might be easier to import those methods?

But essentially, you convert to a dict and serialize as json using the Monty encoder.

@ardunn
Copy link
Contributor Author

ardunn commented Oct 1, 2018

Ok cool. Does anyone have issues with eventually converting all the data over to json? @albalu @Qi-max

Also I think we can eventually have all the data loaded seaborn style, as is a current issue on matminer right now

@Doppe1g4nger
Copy link
Contributor

Something I'd like to do soon is get the dataset handling transferred over to the matminer style. Once the seaborn style handling code is implemented in matminer it should be as simple as importing the loader code and defining a dictionary of file metadata.

If the datasets are going stay in the release package it would also be nice to eventually get whatever our final format is stored on figshare so they don't all take up so much disk space. Assuming they will be used for examples and not part of the core package that is.

@ardunn
Copy link
Contributor Author

ardunn commented Oct 1, 2018

@Doppe1g4nger that seems like the best course of action. In fact it would be good to have all these datasets common between matminer and matbench if possible

@ardunn ardunn changed the title Datasets stored as csv are problematic Dataset storage needs improvement Oct 2, 2018
@ardunn
Copy link
Contributor Author

ardunn commented Oct 19, 2018

@Doppe1g4nger and @ADA110 feel free to close when we get done migrating this data over

@ardunn
Copy link
Contributor Author

ardunn commented Nov 15, 2018

closed thanks to @Doppe1g4nger and @ADA110 good work guys

@ardunn ardunn closed this as completed Nov 15, 2018
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants