Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Introduce a cache-friendly Pex disk format. #1424

Closed
jsirois opened this issue Aug 27, 2021 · 3 comments · Fixed by #1431
Closed

Introduce a cache-friendly Pex disk format. #1424

jsirois opened this issue Aug 27, 2021 · 3 comments · Fixed by #1431
Assignees

Comments

@jsirois
Copy link
Member

jsirois commented Aug 27, 2021

There are two fundamental on-disk formats for PEXes:

  1. zip - 1 file
  2. loose - O(10k) files is common.

The zip format defeats cache entry sharing. Each zip is unique if any content changes.

The loose format is perfectly fine-grain cacheable, but the number of entries to cache can be large and this can make storing and retrieving the PEX from a cache slow depending on IO characterisitcs.

There is a natural intermediate set of items to cache: each .deps/X.whl, .bootstrap/ + __main__.py, user code and PEX-INFO
This format would allow wheel cache entries to be re-used and bootstrap cache entries to be re-used (these are the same for a given version of Pex).

So the idea is to add a --spread CLI option. That would, instead of emitting a PEX zip when -o my.pex is used, emit this structure:

my.pex/
  __main__.py
  PEX-INFO
  <user code>
  .bootstrap
  .deps/<wheel 1>.whl
  .deps/<wheel 2>.whl
  ...
  .deps/<wheel N>.whl

Each entry except __main__.py is identical to a standard zipped or unzipped PEX with the following exceptions:

  • .bootstrap is a zip file containing everything in .bootstrap/ plus the original __main__.py
  • .deps/<wheel X>.whl is a zip file containing everything in the corresponding .deps/ installed wheel chroot

The special top-level __main__.py would contain code to check for a ~/.pex/unzipped_pexes/<pex hash> entry and create one if not present and re-exec to it.

This would net a full-performance PEX that always ran in either today's --unzip mode or, if --venv were specified in the build, would do a 1 time venv extract and run in today's --venv mode.

The cache characteristics would be intermediate between loose and zip with O(100) files being common. An option to control whether the user code was loose or itself zipped up might make sense.

This idea could be 1st implemented as a pex tool: pex-tools <a.pex> spread -o <here> that could convert a standard loose or zip PEX into a spread format PEX. That code could then be re-used to implement a --spread option, analagous to how the --venv mode evolved.

@stuhood
Copy link

stuhood commented Aug 27, 2021

Thanks for sketching this out.

One critical bit of the usecase that needs consideration as well is that we currently create a large repository.pex (from a lockfile) in order to actually run a network-reaching resolve a single time, and then construct subset PEXes using --pex-repository. I say currently, because it's possible that if resolving "from a lockfile" is fast enough, we might be able to skip creating the monolithic artifact, and instead create all of the more-granular PEXes that we need directly from the lockfile... this seems likely?

But while we're using repository.pex, it would be important that if both the input PEX and the output PEX are --spreaded, that --pex-repository would just copy/hardlink the files rather than extracting/recompressing... otherwise the work done to create each subset is proportional to the size of the subset.

One advantage of the PEX_PATH approach is that when creating N subsets of a PEX, the work you do to prepare each subset is completely incremental: if you have 150 wheels/single-entry PEXes, you can externally compose them without relying on the input and output of the subsetting process being identical. Instead, to construct a subset, you'd build the eggs/single-entry-PEXes in graph order, and the output of any particular entry is 100% additive: you capture only the single new output.

So it seems like a bit of a tossup: if the PEX_PATH worked the way the JVM classpath did, it would be very efficient. But perhaps lockfiles will allow for skipping the creation of the repository.pex... in which case, --spread could make a lot of sense.

@jsirois
Copy link
Member Author

jsirois commented Aug 27, 2021

But while we're using repository.pex, it would be important that if both the input PEX and the output PEX are --spreaded, that --pex-repository would just copy/hardlink the files rather than extracting/recompressing... otherwise the work done to create each subset is proportional to the size of the subset.

This optimization can't be done in the pex tool implementation of course but it could be done when that graduated to a --spread option. The difference between "important" and "critical" (your words) and "optimization" (mine), seems like it requires some measurement. There are alot of really slow things here in Python land and some of the optimizations don't end up mattering much because they're dwarfed by X. But either way I agree it would be an excellent property to have,

jsirois added a commit to jsirois/pex that referenced this issue Aug 31, 2021
Beyond the top level `pex` and `__main__.py` files, the layout is an
implementation detail, but the current structure sheds some light on the
cache-friendly characteristics:

Given loose sources:
```
$ cat src/main.py
from pex.version import __version__; print(__version__)
```

A spread layout PEX looks like:
```
$ python -mpex pex -Dsrc -emain -opex.spread.venv --spread \
  --seed verbose --venv prepend | jq .
{
  "pex_root": "/home/jsirois/.pex",
  "python": "/usr/bin/python3.9",
  "pex": "/home/jsirois/.pex/venvs/d2f743b9c1ebb156f794419c01b3422653cbdb61/2d1d404c3de23b1810386195be7410700b1feb14/pex"
}
$ tree -a pex.spread.venv/
pex.spread.venv/
├── .bootstrap
├── .deps
│   └── pex-2.1.46-py2.py3-none-any.whl
├── __main__.py
├── pex -> __main__.py
├── PEX-SPREAD-INFO
└── src
    ├── __main__.py
    ├── main.py
    └── PEX-INFO
```

And the runtime spreading is suggested by the new PEX-SPREAD-INFO
manifest:
```json
{
  "sources": [
    "PEX-INFO",
    "__main__.py",
    "main.py"
  ],
  "spreads": [
    {
      "strip_zip_relpath": false,
      "unpack_relpath": "bootstraps/a54b6ae5e64e5b229388fdffc8adac141f3c416b",
      "zip_relpath": ".bootstrap"
    },
    {
      "strip_zip_relpath": true,
      "unpack_relpath": "installed_wheels/f627f0368a0e29be24aa8cadba74044b9ad990d7/pex-2.1.46-py2.py3-none-any.whl",
      "zip_relpath": ".deps/pex-2.1.46-py2.py3-none-any.whl"
    }
  ]
}
```

The layout adds new PEX_ROOT caches for the `.bootstrap` zip and
installed wheel chroot zips such that neither bootstraps - which are
tied to a version of Pex, nor installed wheel chroot zips, which are
constant for a given distribution version, are created more than once.

Closes pex-tool#1424
@jsirois jsirois self-assigned this Aug 31, 2021
@jsirois
Copy link
Member Author

jsirois commented Aug 31, 2021

The optimization was trivial so I went ahead and added it for both .bootstrap and installed wheel chroot zips.

jsirois added a commit that referenced this issue Sep 2, 2021
Beyond the top level `pex` and `__main__.py` files, the layout is an
implementation detail, but the current structure sheds some light on the
cache-friendly characteristics:

Given loose sources:
```
$ cat src/main.py
from pex.version import __version__; print(__version__)
```

A spread layout PEX looks like:
```
$ python -mpex pex -Dsrc -emain -opex.spread.venv --spread \
  --seed verbose --venv prepend | jq .
{
  "pex_root": "/home/jsirois/.pex",
  "python": "/usr/bin/python3.9",
  "pex": "/home/jsirois/.pex/venvs/d2f743b9c1ebb156f794419c01b3422653cbdb61/2d1d404c3de23b1810386195be7410700b1feb14/pex"
}
$ tree -a pex.spread.venv/
pex.spread.venv/
├── .bootstrap
├── .deps
│   └── pex-2.1.46-py2.py3-none-any.whl
├── __main__.py
├── pex -> __main__.py
├── PEX-SPREAD-INFO
└── src
    ├── __main__.py
    ├── main.py
    └── PEX-INFO
```

And the runtime spreading is suggested by the new PEX-SPREAD-INFO
manifest:
```json
{
  "sources": [
    "PEX-INFO",
    "__main__.py",
    "main.py"
  ],
  "spreads": [
    {
      "strip_zip_relpath": false,
      "unpack_relpath": "bootstraps/a54b6ae5e64e5b229388fdffc8adac141f3c416b",
      "zip_relpath": ".bootstrap"
    },
    {
      "strip_zip_relpath": true,
      "unpack_relpath": "installed_wheels/f627f0368a0e29be24aa8cadba74044b9ad990d7/pex-2.1.46-py2.py3-none-any.whl",
      "zip_relpath": ".deps/pex-2.1.46-py2.py3-none-any.whl"
    }
  ]
}
```

The layout adds new PEX_ROOT caches for the `.bootstrap` zip and
installed wheel chroot zips such that neither bootstraps - which are
tied to a version of Pex, nor installed wheel chroot zips, which are
constant for a given distribution version, are created more than once.

Closes #1424
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants