Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

six million sha256 hashes for 300K packages - a use case #2969

Closed
milahu opened this issue Mar 13, 2022 · 4 comments
Closed

six million sha256 hashes for 300K packages - a use case #2969

milahu opened this issue Mar 13, 2022 · 4 comments

Comments

@milahu
Copy link

milahu commented Mar 13, 2022

i'm trying to mirror the pypi.org python package index as a distributed versioned database

it's a toy project, not (yet) useful

could be useful to "outsource" the boring part of nixpkgs: versions, file hashes
a package update in nixpkgs is *mostly* just an update of version and file hash
this updating could be automated (guarded by tests), and manual intervention should be the exception

could be useful to provide a shared database for multiple consumers: nixos, guix, bazel, ...
→ solution for How do I download the entire pypi Python Package Index

i found dolt via How can I put a database under git (version control)?

i'm posting my use case here, since i dont like the "chattiness" of discord
feel free to move this to a github discussion

status

i'm currently scraping json files from pypi.org (boring old json api ...)
json size for 360K packages:

  • 5.7GB raw
  • 1.2GB tar.gz
  • 0.9GB 7z

next steps:

  • find a SQL schema. optimize for ...
    1. small size
    2. fast read
    3. fast write
  • throw six million rows at dolt, test performance as it grows
  • get an update feed from pypi.org or bigquery-public-data
  • how often should i insert new rows, to keep the database small? more inserts = more size from the versioning overhead
  • use a different system to push frequent updates: matrix chat, IRC chat, OrbitDB, GunDB, websockets, IMAP, ...
    • this allows me to update my mirror with a lower frequency, say every 10 days, to reduce overhead from versioning = hybrid of few snapshots + many diffs versus the last snapshot
    • should be a "push" system = no RSS feed

similar projects

https://github.com/DavHau/nix-pypi-fetcher
database with filenames (urls) and hashes of all python packages
json in git
1 GB data, 250 MB compressed

https://github.com/DavHau/pypi-deps-db
dependency graph of all python packages
json in git
1.2 GB data, 80 MB compressed

bigquery-public-data:pypi hosted by google
rate limited, commercial
see also https://warehouse.pypa.io/api-reference/bigquery-datasets.html
table bigquery-public-data:pypi.distribution_metadata
has 6.414.893 rows and 31,37 GB → "six million sha256 hashes" = 192 MByte of raw sha256 data ("the payload")
number of packages is around 360K
data is highly redundant, hopefully can be compressed to 5% = 1.5 GB
can be useful to find popular versions of python packages, to further reduce the size of my mirror to (let's say) 500 MByte per snapshot
some packges (*-nightly) literally have one release every day → mostly junk data
if a user requests a version that is missing in my mirror, i can call it a "cache miss"
and expect the user to come up with a workaround = manually add the dataset (version, filename, hash) to his app

https://github.com/rust-lang/crates.io-index
release metadata for crates.io (rust) packages
textfile based, jsonlines format
commit history is truncated regularly → rust-lang/crates-io-cargo-teams#47

https://github.com/NixOS/nixpkgs
collection of packages
build scripts, file hashes, file URLs, dependencies
boring part: hashes, URLs

https://github.com/on-nix/python
handmade collection of popular python packages
subset of the pypi index
versions, file hashes, file names, dependencies
example package: PySide6-6.2.3

alternatives considered

  • other distributed databaes
  • graph databases
    • edgedb
      • not distributed (?)
      • prettier schema than SQL
  • git, hosted on github. also used by crates.io-index
    • git-friendly database filesystems
      • sqlite-diffable to load/store a sqlite database from/to json files (via sirdb via yc)
        • one ndjson (jsonlines) file for all records = good usage of filesystem blocks
      • sirdb
        • one file per record = poor usage of filesystem blocks (4 KByte)
      • binary files in git
        • not space efficient?
        • git packfiles: compress, pack, compress = poor compression?
        • to see pretty diffs for sqlite: add [diff "sqlite3"]\n textconv = sqlite3 $1 .dump to .gitconfig
        • custom format: protocolbuffers, msgpack, ...
    • why not use git?
      • git packfiles give poor compression
      • implementation cost for file parsing and text compression
    • why use git?
      • free hosting on github ...
  • IPFS, OrbitDB, GunDB, BigChainDB, ...
    • i'm not sure if these systems allow the kind of "moderation" we know from github, to defend against abuse, but to allow collaboration and forking
@timsehn
Copy link
Contributor

timsehn commented Mar 13, 2022

Thanks for the information! Cool use case.

We also have DoltHub (https://www.dolthub.com) where we host public databases for free. It has similar features to GitHub (pull requests, forks, issues, etc). If GitHub was a selling point of Git, we have you covered there.

6M rows should definitely be in our scaling boundaries.

We'll fix the binary parsing bug and see if that gets you going.

@milahu
Copy link
Author

milahu commented Mar 14, 2022

We'll fix the binary parsing bug and see if that gets you going.

all good, im currently using a workaround = 4x uint64 fields

@bjorn3
Copy link

bjorn3 commented Jul 2, 2022

https://github.com/rust-lang/crates.io-index
release metadata for crates.io (rust) packages
textfile based, jsonlines format
commit history is truncated regularly → rust-lang/crates-io-cargo-teams#47

Every time the history is squashed, a new branch is created over at https://github.com/rust-lang/crates.io-index-archive with the old history. At least the last time the squashed commit (rust-lang/crates.io-index@d511f68) referenced the previous commit. Not sure if that happened every time though.

@timsehn
Copy link
Contributor

timsehn commented Aug 3, 2022

Resolving. Thanks for sharing your use case.

@timsehn timsehn closed this as completed Aug 3, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants