Introduce a cache-friendly Pex disk format. #1424

jsirois · 2021-08-27T15:43:06Z

There are two fundamental on-disk formats for PEXes:

zip - 1 file
loose - O(10k) files is common.

The zip format defeats cache entry sharing. Each zip is unique if any content changes.

The loose format is perfectly fine-grain cacheable, but the number of entries to cache can be large and this can make storing and retrieving the PEX from a cache slow depending on IO characterisitcs.

There is a natural intermediate set of items to cache: each .deps/X.whl, .bootstrap/ + __main__.py, user code and PEX-INFO
This format would allow wheel cache entries to be re-used and bootstrap cache entries to be re-used (these are the same for a given version of Pex).

So the idea is to add a --spread CLI option. That would, instead of emitting a PEX zip when -o my.pex is used, emit this structure:

my.pex/
  __main__.py
  PEX-INFO
  <user code>
  .bootstrap
  .deps/<wheel 1>.whl
  .deps/<wheel 2>.whl
  ...
  .deps/<wheel N>.whl

Each entry except __main__.py is identical to a standard zipped or unzipped PEX with the following exceptions:

.bootstrap is a zip file containing everything in .bootstrap/ plus the original __main__.py
.deps/<wheel X>.whl is a zip file containing everything in the corresponding .deps/ installed wheel chroot

The special top-level __main__.py would contain code to check for a ~/.pex/unzipped_pexes/<pex hash> entry and create one if not present and re-exec to it.

This would net a full-performance PEX that always ran in either today's --unzip mode or, if --venv were specified in the build, would do a 1 time venv extract and run in today's --venv mode.

The cache characteristics would be intermediate between loose and zip with O(100) files being common. An option to control whether the user code was loose or itself zipped up might make sense.

This idea could be 1st implemented as a pex tool: pex-tools <a.pex> spread -o <here> that could convert a standard loose or zip PEX into a spread format PEX. That code could then be re-used to implement a --spread option, analagous to how the --venv mode evolved.

The text was updated successfully, but these errors were encountered:

stuhood · 2021-08-27T16:36:44Z

Thanks for sketching this out.

One critical bit of the usecase that needs consideration as well is that we currently create a large repository.pex (from a lockfile) in order to actually run a network-reaching resolve a single time, and then construct subset PEXes using --pex-repository. I say currently, because it's possible that if resolving "from a lockfile" is fast enough, we might be able to skip creating the monolithic artifact, and instead create all of the more-granular PEXes that we need directly from the lockfile... this seems likely?

But while we're using repository.pex, it would be important that if both the input PEX and the output PEX are --spreaded, that --pex-repository would just copy/hardlink the files rather than extracting/recompressing... otherwise the work done to create each subset is proportional to the size of the subset.

One advantage of the PEX_PATH approach is that when creating N subsets of a PEX, the work you do to prepare each subset is completely incremental: if you have 150 wheels/single-entry PEXes, you can externally compose them without relying on the input and output of the subsetting process being identical. Instead, to construct a subset, you'd build the eggs/single-entry-PEXes in graph order, and the output of any particular entry is 100% additive: you capture only the single new output.

So it seems like a bit of a tossup: if the PEX_PATH worked the way the JVM classpath did, it would be very efficient. But perhaps lockfiles will allow for skipping the creation of the repository.pex... in which case, --spread could make a lot of sense.

jsirois · 2021-08-27T16:55:04Z

But while we're using repository.pex, it would be important that if both the input PEX and the output PEX are --spreaded, that --pex-repository would just copy/hardlink the files rather than extracting/recompressing... otherwise the work done to create each subset is proportional to the size of the subset.

This optimization can't be done in the pex tool implementation of course but it could be done when that graduated to a --spread option. The difference between "important" and "critical" (your words) and "optimization" (mine), seems like it requires some measurement. There are alot of really slow things here in Python land and some of the optimizations don't end up mattering much because they're dwarfed by X. But either way I agree it would be an excellent property to have,

Beyond the top level `pex` and `__main__.py` files, the layout is an implementation detail, but the current structure sheds some light on the cache-friendly characteristics: Given loose sources: ``` $ cat src/main.py from pex.version import __version__; print(__version__) ``` A spread layout PEX looks like: ``` $ python -mpex pex -Dsrc -emain -opex.spread.venv --spread \ --seed verbose --venv prepend | jq . { "pex_root": "/home/jsirois/.pex", "python": "/usr/bin/python3.9", "pex": "/home/jsirois/.pex/venvs/d2f743b9c1ebb156f794419c01b3422653cbdb61/2d1d404c3de23b1810386195be7410700b1feb14/pex" } $ tree -a pex.spread.venv/ pex.spread.venv/ ├── .bootstrap ├── .deps │ └── pex-2.1.46-py2.py3-none-any.whl ├── __main__.py ├── pex -> __main__.py ├── PEX-SPREAD-INFO └── src ├── __main__.py ├── main.py └── PEX-INFO ``` And the runtime spreading is suggested by the new PEX-SPREAD-INFO manifest: ```json { "sources": [ "PEX-INFO", "__main__.py", "main.py" ], "spreads": [ { "strip_zip_relpath": false, "unpack_relpath": "bootstraps/a54b6ae5e64e5b229388fdffc8adac141f3c416b", "zip_relpath": ".bootstrap" }, { "strip_zip_relpath": true, "unpack_relpath": "installed_wheels/f627f0368a0e29be24aa8cadba74044b9ad990d7/pex-2.1.46-py2.py3-none-any.whl", "zip_relpath": ".deps/pex-2.1.46-py2.py3-none-any.whl" } ] } ``` The layout adds new PEX_ROOT caches for the `.bootstrap` zip and installed wheel chroot zips such that neither bootstraps - which are tied to a version of Pex, nor installed wheel chroot zips, which are constant for a given distribution version, are created more than once. Closes pex-tool#1424

jsirois · 2021-08-31T02:04:22Z

The optimization was trivial so I went ahead and added it for both .bootstrap and installed wheel chroot zips.

Beyond the top level `pex` and `__main__.py` files, the layout is an implementation detail, but the current structure sheds some light on the cache-friendly characteristics: Given loose sources: ``` $ cat src/main.py from pex.version import __version__; print(__version__) ``` A spread layout PEX looks like: ``` $ python -mpex pex -Dsrc -emain -opex.spread.venv --spread \ --seed verbose --venv prepend | jq . { "pex_root": "/home/jsirois/.pex", "python": "/usr/bin/python3.9", "pex": "/home/jsirois/.pex/venvs/d2f743b9c1ebb156f794419c01b3422653cbdb61/2d1d404c3de23b1810386195be7410700b1feb14/pex" } $ tree -a pex.spread.venv/ pex.spread.venv/ ├── .bootstrap ├── .deps │ └── pex-2.1.46-py2.py3-none-any.whl ├── __main__.py ├── pex -> __main__.py ├── PEX-SPREAD-INFO └── src ├── __main__.py ├── main.py └── PEX-INFO ``` And the runtime spreading is suggested by the new PEX-SPREAD-INFO manifest: ```json { "sources": [ "PEX-INFO", "__main__.py", "main.py" ], "spreads": [ { "strip_zip_relpath": false, "unpack_relpath": "bootstraps/a54b6ae5e64e5b229388fdffc8adac141f3c416b", "zip_relpath": ".bootstrap" }, { "strip_zip_relpath": true, "unpack_relpath": "installed_wheels/f627f0368a0e29be24aa8cadba74044b9ad990d7/pex-2.1.46-py2.py3-none-any.whl", "zip_relpath": ".deps/pex-2.1.46-py2.py3-none-any.whl" } ] } ``` The layout adds new PEX_ROOT caches for the `.bootstrap` zip and installed wheel chroot zips such that neither bootstraps - which are tied to a version of Pex, nor installed wheel chroot zips, which are constant for a given distribution version, are created more than once. Closes #1424

jsirois added the feature request label Aug 27, 2021

jsirois mentioned this issue Aug 27, 2021

PEX_PATH transitivity #1423

Closed

jsirois mentioned this issue Aug 31, 2021

Introduce the spread app layout. #1431

Merged

jsirois self-assigned this Aug 31, 2021

jsirois added the in progress label Aug 31, 2021

jsirois closed this as completed in #1431 Sep 2, 2021

jsirois removed the in progress label Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a cache-friendly Pex disk format. #1424

Introduce a cache-friendly Pex disk format. #1424

jsirois commented Aug 27, 2021 •

edited

Loading

stuhood commented Aug 27, 2021 •

edited

Loading

jsirois commented Aug 27, 2021 •

edited

Loading

jsirois commented Aug 31, 2021

Introduce a cache-friendly Pex disk format. #1424

Introduce a cache-friendly Pex disk format. #1424

Comments

jsirois commented Aug 27, 2021 • edited Loading

stuhood commented Aug 27, 2021 • edited Loading

jsirois commented Aug 27, 2021 • edited Loading

jsirois commented Aug 31, 2021

jsirois commented Aug 27, 2021 •

edited

Loading

stuhood commented Aug 27, 2021 •

edited

Loading

jsirois commented Aug 27, 2021 •

edited

Loading