-
-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
six million sha256 hashes for 300K packages - a use case #2969
Comments
Thanks for the information! Cool use case. We also have DoltHub (https://www.dolthub.com) where we host public databases for free. It has similar features to GitHub (pull requests, forks, issues, etc). If GitHub was a selling point of Git, we have you covered there. 6M rows should definitely be in our scaling boundaries. We'll fix the binary parsing bug and see if that gets you going. |
all good, im currently using a workaround = 4x uint64 fields |
Every time the history is squashed, a new branch is created over at https://github.com/rust-lang/crates.io-index-archive with the old history. At least the last time the squashed commit (rust-lang/crates.io-index@d511f68) referenced the previous commit. Not sure if that happened every time though. |
Resolving. Thanks for sharing your use case. |
i'm trying to mirror the pypi.org python package index as a distributed versioned database
it's a toy project, not (yet) useful
could be useful to "outsource" the boring part of nixpkgs: versions, file hashes
a package update in nixpkgs is *mostly* just an update of version and file hash
this updating could be automated (guarded by tests), and manual intervention should be the exception
could be useful to provide a shared database for multiple consumers: nixos, guix, bazel, ...
→ solution for How do I download the entire pypi Python Package Index
i found dolt via How can I put a database under git (version control)?
i'm posting my use case here, since i dont like the "chattiness" of discord
feel free to move this to a github discussion
status
i'm currently scraping json files from pypi.org (boring old json api ...)
json size for 360K packages:
next steps:
similar projects
https://github.com/DavHau/nix-pypi-fetcher
database with filenames (urls) and hashes of all python packages
json in git
1 GB data, 250 MB compressed
https://github.com/DavHau/pypi-deps-db
dependency graph of all python packages
json in git
1.2 GB data, 80 MB compressed
bigquery-public-data:pypi hosted by google
rate limited, commercial
see also https://warehouse.pypa.io/api-reference/bigquery-datasets.html
table
bigquery-public-data:pypi.distribution_metadata
has 6.414.893 rows and 31,37 GB → "six million sha256 hashes" = 192 MByte of raw sha256 data ("the payload")
number of packages is around 360K
data is highly redundant, hopefully can be compressed to 5% = 1.5 GB
can be useful to find popular versions of python packages, to further reduce the size of my mirror to (let's say) 500 MByte per snapshot
some packges (
*-nightly
) literally have one release every day → mostly junk dataif a user requests a version that is missing in my mirror, i can call it a "cache miss"
and expect the user to come up with a workaround = manually add the dataset (version, filename, hash) to his app
https://github.com/rust-lang/crates.io-index
release metadata for crates.io (rust) packages
textfile based, jsonlines format
commit history is truncated regularly → rust-lang/crates-io-cargo-teams#47
https://github.com/NixOS/nixpkgs
collection of packages
build scripts, file hashes, file URLs, dependencies
boring part: hashes, URLs
https://github.com/on-nix/python
handmade collection of popular python packages
subset of the pypi index
versions, file hashes, file names, dependencies
example package: PySide6-6.2.3
alternatives considered
[diff "sqlite3"]\n textconv = sqlite3 $1 .dump
to.gitconfig
The text was updated successfully, but these errors were encountered: