TorchSnapshot (Beta Release)

A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind.

Install

Requires Python >= 3.8 and PyTorch >= 2.0.0

From pip:

# Stable
pip install torchsnapshot
# Or, using conda
conda install -c conda-forge torchsnapshot

# Nightly
pip install --pre torchsnapshot-nightly

From source:

git clone https://github.com/pytorch/torchsnapshot
cd torchsnapshot
pip install -r requirements.txt
python setup.py install

Why TorchSnapshot

Performance

TorchSnapshot provides a fast checkpointing implementation employing various optimizations, including zero-copy serialization for most tensor types, overlapped device-to-host copy and storage I/O, parallelized storage I/O.
TorchSnapshot greatly speeds up checkpointing for DistributedDataParallel workloads by distributing the write load across all ranks (benchmark).
When host memory is abundant, TorchSnapshot allows training to resume before all storage I/O completes, reducing the time blocked by checkpoint saving.

Memory Usage

TorchSnapshot's memory usage adapts to the host's available resources, greatly reducing the chance of out-of-memory issues when saving and loading checkpoints.
TorchSnapshot supports efficient random access to individual objects within a snapshot, even when the snapshot is stored in a cloud object storage.

Usability

Simple APIs that are consistent between distributed and non-distributed workloads.
Out of the box integration with commonly used cloud object storage systems.
Automatic resharding (elasticity) on world size change for supported workloads (more details).

Security

Secure tensor serialization without pickle dependency [WIP].

Getting Started

from torchsnapshot import Snapshot

# Taking a snapshot
app_state = {"model": model, "optimizer": optimizer}
snapshot = Snapshot.take(path="/path/to/snapshot", app_state=app_state)

# Restoring from a snapshot
snapshot.restore(app_state=app_state)

See the documentation for more details.

License

torchsnapshot is BSD licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
tests		tests
torchsnapshot		torchsnapshot
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dev-requirements.txt		dev-requirements.txt
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TorchSnapshot (Beta Release)

Install

Why TorchSnapshot

Getting Started

License

About

Releases 1

Packages

Contributors 33

Languages

License

pytorch/torchsnapshot

Folders and files

Latest commit

History

Repository files navigation

TorchSnapshot (Beta Release)

Install

Why TorchSnapshot

Getting Started

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 33

Languages

Packages