Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Read/write large datasets in JupyterLab environment #1

Closed
nelsonni opened this issue Oct 1, 2020 · 1 comment
Closed

Read/write large datasets in JupyterLab environment #1

nelsonni opened this issue Oct 1, 2020 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@nelsonni
Copy link
Member

nelsonni commented Oct 1, 2020

Repository mining often requires trial and error to properly answer complex data-driven questions. When working with large datasets (i.e. large repositories with a long development history), pulling down data from the GitHub API can take time. Being able to cache or save these large datasets to/from files can eliminate this time delay and remove any potential of hitting the GitHub rate limit.

The naïve approach would be to use the standard Python 3 built-in input/output functions and the data encoding/decoding capabilities of the json library. For example:

# writing the data to the file, so we don't have to rerun it again
with open("pulls.json", "w") as f:
    f.write(json.dumps(pulls))
    
# if data is already extracted, simply load it into the environment
with open("pulls.json", "r") as f:
    pulls = json.loads(f.read())

However, for larger datasets this solution will throw the following errors if executed in a JupyterLab notebook environment:

IOStream.flush timed out
[W 13:37:54.357 LabApp] IOPub data rate exceeded.
    The notebook server will temporarily stop sending output
    to the client in order to avoid crashing it.
    To change this limit, set the config variable
    `--NotebookApp.iopub_data_rate_limit`.
    
    Current values:
    NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
    NotebookApp.rate_limit_window=3.0 (secs)
@nelsonni nelsonni added the bug Something isn't working label Oct 1, 2020
@nelsonni nelsonni self-assigned this Oct 1, 2020
@nelsonni
Copy link
Member Author

nelsonni commented Oct 2, 2020

Resolved in c5eadb5, by adding readData(path) and writeData(path, data) methods and including the use of bigjson for handling IO issues with large JSON files.

@nelsonni nelsonni closed this as completed Oct 2, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant