Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Requesting example to use PyTorch FSDP #19

Open
abdulmuneer opened this issue Apr 22, 2024 · 1 comment
Open

Requesting example to use PyTorch FSDP #19

abdulmuneer opened this issue Apr 22, 2024 · 1 comment

Comments

@abdulmuneer
Copy link

Hi,
Does Determined support the PyTorch FSDP way of distributed training? I can see examples for DeepSpeed, but I have a requirement to specifically use native FSDP feature of PyTorch 2.2 (something like https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?highlight=pre%20training).

@ioga
Copy link
Contributor

ioga commented Apr 22, 2024

hello, we haven't added it here yet, but there's an unofficial example here: https://github.com/garrett361/determined/tree/scratchwork/scratchwork/fsdp_min

For the context, PytorchTrial does not support FSDP and there're no plans to add that. For FSDP, you should use Core API instead, and it'll be effectively the same as the torch DDP: standard torch distributed launcher works the same, metrics logging and hpsearch work the same. if you checkpoint full model from rank=0, it'll work the same as well. if you want to do sharded checkpointing, use the sharded checkpointing shard=True option.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants