-
-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Introduce a cache-friendly Pex disk format. #1424
Comments
Thanks for sketching this out. One critical bit of the usecase that needs consideration as well is that we currently create a large But while we're using One advantage of the So it seems like a bit of a tossup: if the PEX_PATH worked the way the JVM classpath did, it would be very efficient. But perhaps lockfiles will allow for skipping the creation of the |
This optimization can't be done in the pex tool implementation of course but it could be done when that graduated to a |
Beyond the top level `pex` and `__main__.py` files, the layout is an implementation detail, but the current structure sheds some light on the cache-friendly characteristics: Given loose sources: ``` $ cat src/main.py from pex.version import __version__; print(__version__) ``` A spread layout PEX looks like: ``` $ python -mpex pex -Dsrc -emain -opex.spread.venv --spread \ --seed verbose --venv prepend | jq . { "pex_root": "/home/jsirois/.pex", "python": "/usr/bin/python3.9", "pex": "/home/jsirois/.pex/venvs/d2f743b9c1ebb156f794419c01b3422653cbdb61/2d1d404c3de23b1810386195be7410700b1feb14/pex" } $ tree -a pex.spread.venv/ pex.spread.venv/ ├── .bootstrap ├── .deps │ └── pex-2.1.46-py2.py3-none-any.whl ├── __main__.py ├── pex -> __main__.py ├── PEX-SPREAD-INFO └── src ├── __main__.py ├── main.py └── PEX-INFO ``` And the runtime spreading is suggested by the new PEX-SPREAD-INFO manifest: ```json { "sources": [ "PEX-INFO", "__main__.py", "main.py" ], "spreads": [ { "strip_zip_relpath": false, "unpack_relpath": "bootstraps/a54b6ae5e64e5b229388fdffc8adac141f3c416b", "zip_relpath": ".bootstrap" }, { "strip_zip_relpath": true, "unpack_relpath": "installed_wheels/f627f0368a0e29be24aa8cadba74044b9ad990d7/pex-2.1.46-py2.py3-none-any.whl", "zip_relpath": ".deps/pex-2.1.46-py2.py3-none-any.whl" } ] } ``` The layout adds new PEX_ROOT caches for the `.bootstrap` zip and installed wheel chroot zips such that neither bootstraps - which are tied to a version of Pex, nor installed wheel chroot zips, which are constant for a given distribution version, are created more than once. Closes pex-tool#1424
The optimization was trivial so I went ahead and added it for both |
Beyond the top level `pex` and `__main__.py` files, the layout is an implementation detail, but the current structure sheds some light on the cache-friendly characteristics: Given loose sources: ``` $ cat src/main.py from pex.version import __version__; print(__version__) ``` A spread layout PEX looks like: ``` $ python -mpex pex -Dsrc -emain -opex.spread.venv --spread \ --seed verbose --venv prepend | jq . { "pex_root": "/home/jsirois/.pex", "python": "/usr/bin/python3.9", "pex": "/home/jsirois/.pex/venvs/d2f743b9c1ebb156f794419c01b3422653cbdb61/2d1d404c3de23b1810386195be7410700b1feb14/pex" } $ tree -a pex.spread.venv/ pex.spread.venv/ ├── .bootstrap ├── .deps │ └── pex-2.1.46-py2.py3-none-any.whl ├── __main__.py ├── pex -> __main__.py ├── PEX-SPREAD-INFO └── src ├── __main__.py ├── main.py └── PEX-INFO ``` And the runtime spreading is suggested by the new PEX-SPREAD-INFO manifest: ```json { "sources": [ "PEX-INFO", "__main__.py", "main.py" ], "spreads": [ { "strip_zip_relpath": false, "unpack_relpath": "bootstraps/a54b6ae5e64e5b229388fdffc8adac141f3c416b", "zip_relpath": ".bootstrap" }, { "strip_zip_relpath": true, "unpack_relpath": "installed_wheels/f627f0368a0e29be24aa8cadba74044b9ad990d7/pex-2.1.46-py2.py3-none-any.whl", "zip_relpath": ".deps/pex-2.1.46-py2.py3-none-any.whl" } ] } ``` The layout adds new PEX_ROOT caches for the `.bootstrap` zip and installed wheel chroot zips such that neither bootstraps - which are tied to a version of Pex, nor installed wheel chroot zips, which are constant for a given distribution version, are created more than once. Closes #1424
There are two fundamental on-disk formats for PEXes:
The zip format defeats cache entry sharing. Each zip is unique if any content changes.
The loose format is perfectly fine-grain cacheable, but the number of entries to cache can be large and this can make storing and retrieving the PEX from a cache slow depending on IO characterisitcs.
There is a natural intermediate set of items to cache: each
.deps/X.whl
,.bootstrap/
+__main__.py
, user code andPEX-INFO
This format would allow wheel cache entries to be re-used and bootstrap cache entries to be re-used (these are the same for a given version of Pex).
So the idea is to add a
--spread
CLI option. That would, instead of emitting a PEX zip when-o my.pex
is used, emit this structure:Each entry except
__main__.py
is identical to a standard zipped or unzipped PEX with the following exceptions:.bootstrap
is a zip file containing everything in .bootstrap/ plus the original__main__.py
.deps/<wheel X>.whl
is a zip file containing everything in the corresponding.deps/
installed wheel chrootThe special top-level
__main__.py
would contain code to check for a~/.pex/unzipped_pexes/<pex hash>
entry and create one if not present and re-exec to it.This would net a full-performance PEX that always ran in either today's --unzip mode or, if --venv were specified in the build, would do a 1 time venv extract and run in today's --venv mode.
The cache characteristics would be intermediate between loose and zip with O(100) files being common. An option to control whether the user code was loose or itself zipped up might make sense.
This idea could be 1st implemented as a pex tool:
pex-tools <a.pex> spread -o <here>
that could convert a standard loose or zip PEX into a spread format PEX. That code could then be re-used to implement a--spread
option, analagous to how the--venv
mode evolved.The text was updated successfully, but these errors were encountered: