Read/write large datasets in JupyterLab environment #1

nelsonni · 2020-10-01T21:23:21Z

Repository mining often requires trial and error to properly answer complex data-driven questions. When working with large datasets (i.e. large repositories with a long development history), pulling down data from the GitHub API can take time. Being able to cache or save these large datasets to/from files can eliminate this time delay and remove any potential of hitting the GitHub rate limit.

The naïve approach would be to use the standard Python 3 built-in input/output functions and the data encoding/decoding capabilities of the json library. For example:

# writing the data to the file, so we don't have to rerun it again
with open("pulls.json", "w") as f:
    f.write(json.dumps(pulls))
    
# if data is already extracted, simply load it into the environment
with open("pulls.json", "r") as f:
    pulls = json.loads(f.read())

However, for larger datasets this solution will throw the following errors if executed in a JupyterLab notebook environment:

IOStream.flush timed out
[W 13:37:54.357 LabApp] IOPub data rate exceeded.
    The notebook server will temporarily stop sending output
    to the client in order to avoid crashing it.
    To change this limit, set the config variable
    `--NotebookApp.iopub_data_rate_limit`.
    
    Current values:
    NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
    NotebookApp.rate_limit_window=3.0 (secs)

The text was updated successfully, but these errors were encountered:

nelsonni · 2020-10-02T18:26:55Z

Resolved in c5eadb5, by adding readData(path) and writeData(path, data) methods and including the use of bigjson for handling IO issues with large JSON files.

nelsonni added the bug Something isn't working label Oct 1, 2020

nelsonni self-assigned this Oct 1, 2020

nelsonni closed this as completed Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read/write large datasets in JupyterLab environment #1

Read/write large datasets in JupyterLab environment #1

nelsonni commented Oct 1, 2020

nelsonni commented Oct 2, 2020

Read/write large datasets in JupyterLab environment #1

Read/write large datasets in JupyterLab environment #1

Comments

nelsonni commented Oct 1, 2020

nelsonni commented Oct 2, 2020