six million sha256 hashes for 300K packages - a use case #2969

milahu · 2022-03-13T20:09:58Z

i'm trying to mirror the pypi.org python package index as a distributed versioned database

it's a toy project, not (yet) useful

could be useful to "outsource" the boring part of nixpkgs: versions, file hashes
a package update in nixpkgs is *mostly* just an update of version and file hash
this updating could be automated (guarded by tests), and manual intervention should be the exception

could be useful to provide a shared database for multiple consumers: nixos, guix, bazel, ...
→ solution for How do I download the entire pypi Python Package Index

i found dolt via How can I put a database under git (version control)?

i'm posting my use case here, since i dont like the "chattiness" of discord
feel free to move this to a github discussion

status

i'm currently scraping json files from pypi.org (boring old json api ...)
json size for 360K packages:

5.7GB raw
1.2GB tar.gz
0.9GB 7z

next steps:

find a SQL schema. optimize for ...
1. small size
2. fast read
3. fast write
throw six million rows at dolt, test performance as it grows
get an update feed from pypi.org or bigquery-public-data
- https://warehouse.pypa.io/api-reference/feeds.html
- note to future: start with scraping the update feed, then scrape the json files
how often should i insert new rows, to keep the database small? more inserts = more size from the versioning overhead
use a different system to push frequent updates: matrix chat, IRC chat, OrbitDB, GunDB, websockets, IMAP, ...
- this allows me to update my mirror with a lower frequency, say every 10 days, to reduce overhead from versioning = hybrid of few snapshots + many diffs versus the last snapshot
- should be a "push" system = no RSS feed

similar projects

https://github.com/DavHau/nix-pypi-fetcher
database with filenames (urls) and hashes of all python packages
json in git
1 GB data, 250 MB compressed

https://github.com/DavHau/pypi-deps-db
dependency graph of all python packages
json in git
1.2 GB data, 80 MB compressed

bigquery-public-data:pypi hosted by google
rate limited, commercial
see also https://warehouse.pypa.io/api-reference/bigquery-datasets.html
table bigquery-public-data:pypi.distribution_metadata
has 6.414.893 rows and 31,37 GB → "six million sha256 hashes" = 192 MByte of raw sha256 data ("the payload")
number of packages is around 360K
data is highly redundant, hopefully can be compressed to 5% = 1.5 GB
can be useful to find popular versions of python packages, to further reduce the size of my mirror to (let's say) 500 MByte per snapshot
some packges (*-nightly) literally have one release every day → mostly junk data
if a user requests a version that is missing in my mirror, i can call it a "cache miss"
and expect the user to come up with a workaround = manually add the dataset (version, filename, hash) to his app

https://github.com/rust-lang/crates.io-index
release metadata for crates.io (rust) packages
textfile based, jsonlines format
commit history is truncated regularly → rust-lang/crates-io-cargo-teams#47

https://github.com/NixOS/nixpkgs
collection of packages
build scripts, file hashes, file URLs, dependencies
boring part: hashes, URLs

https://github.com/on-nix/python
handmade collection of popular python packages
subset of the pypi index
versions, file hashes, file names, dependencies
example package: PySide6-6.2.3

alternatives considered

other distributed databaes
- via How can I put a database under git (version control)?
- irmin
  - has a git backend?
  - todo: compare to dolt
graph databases
- edgedb
  - not distributed (?)
  - prettier schema than SQL
git, hosted on github. also used by crates.io-index
- git-friendly database filesystems
  - sqlite-diffable to load/store a sqlite database from/to json files (via sirdb via yc)
    - one ndjson (jsonlines) file for all records = good usage of filesystem blocks
  - sirdb
    - one file per record = poor usage of filesystem blocks (4 KByte)
  - binary files in git
    - not space efficient?
    - git packfiles: compress, pack, compress = poor compression?
    - to see pretty diffs for sqlite: add [diff "sqlite3"]\n textconv = sqlite3 $1 .dump to .gitconfig
    - custom format: protocolbuffers, msgpack, ...
- why not use git?
  - git packfiles give poor compression
  - implementation cost for file parsing and text compression
- why use git?
  - free hosting on github ...
IPFS, OrbitDB, GunDB, BigChainDB, ...
- i'm not sure if these systems allow the kind of "moderation" we know from github, to defend against abuse, but to allow collaboration and forking

The text was updated successfully, but these errors were encountered:

timsehn · 2022-03-13T23:21:13Z

Thanks for the information! Cool use case.

We also have DoltHub (https://www.dolthub.com) where we host public databases for free. It has similar features to GitHub (pull requests, forks, issues, etc). If GitHub was a selling point of Git, we have you covered there.

6M rows should definitely be in our scaling boundaries.

We'll fix the binary parsing bug and see if that gets you going.

milahu · 2022-03-14T09:01:25Z

We'll fix the binary parsing bug and see if that gets you going.

all good, im currently using a workaround = 4x uint64 fields

bjorn3 · 2022-07-02T18:00:16Z

https://github.com/rust-lang/crates.io-index
release metadata for crates.io (rust) packages
textfile based, jsonlines format
commit history is truncated regularly → rust-lang/crates-io-cargo-teams#47

Every time the history is squashed, a new branch is created over at https://github.com/rust-lang/crates.io-index-archive with the old history. At least the last time the squashed commit (rust-lang/crates.io-index@d511f68) referenced the previous commit. Not sure if that happened every time though.

timsehn · 2022-08-03T22:56:45Z

Resolving. Thanks for sharing your use case.

timsehn closed this as completed Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

six million sha256 hashes for 300K packages - a use case #2969

six million sha256 hashes for 300K packages - a use case #2969

milahu commented Mar 13, 2022 •

edited

Loading

timsehn commented Mar 13, 2022

milahu commented Mar 14, 2022 •

edited

Loading

bjorn3 commented Jul 2, 2022

timsehn commented Aug 3, 2022

six million sha256 hashes for 300K packages - a use case #2969

six million sha256 hashes for 300K packages - a use case #2969

Comments

milahu commented Mar 13, 2022 • edited Loading

status

similar projects

alternatives considered

timsehn commented Mar 13, 2022

milahu commented Mar 14, 2022 • edited Loading

bjorn3 commented Jul 2, 2022

timsehn commented Aug 3, 2022

milahu commented Mar 13, 2022 •

edited

Loading

milahu commented Mar 14, 2022 •

edited

Loading