You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Repository mining often requires trial and error to properly answer complex data-driven questions. When working with large datasets (i.e. large repositories with a long development history), pulling down data from the GitHub API can take time. Being able to cache or save these large datasets to/from files can eliminate this time delay and remove any potential of hitting the GitHub rate limit.
The naïve approach would be to use the standard Python 3 built-in input/output functions and the data encoding/decoding capabilities of the json library. For example:
# writing the data to the file, so we don't have to rerun it againwithopen("pulls.json", "w") asf:
f.write(json.dumps(pulls))
# if data is already extracted, simply load it into the environmentwithopen("pulls.json", "r") asf:
pulls=json.loads(f.read())
However, for larger datasets this solution will throw the following errors if executed in a JupyterLab notebook environment:
IOStream.flush timed out
[W 13:37:54.357 LabApp] IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
The text was updated successfully, but these errors were encountered:
Resolved in c5eadb5, by adding readData(path) and writeData(path, data) methods and including the use of bigjson for handling IO issues with large JSON files.
Repository mining often requires trial and error to properly answer complex data-driven questions. When working with large datasets (i.e. large repositories with a long development history), pulling down data from the GitHub API can take time. Being able to cache or save these large datasets to/from files can eliminate this time delay and remove any potential of hitting the GitHub rate limit.
The naïve approach would be to use the standard Python 3 built-in input/output functions and the data encoding/decoding capabilities of the
json
library. For example:However, for larger datasets this solution will throw the following errors if executed in a JupyterLab notebook environment:
The text was updated successfully, but these errors were encountered: