Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[QUESTION] Splitting big models over multiple GPUs #207

Open
zouharvi opened this issue Mar 5, 2024 · 6 comments
Open

[QUESTION] Splitting big models over multiple GPUs #207

zouharvi opened this issue Mar 5, 2024 · 6 comments
Labels
question Further information is requested

Comments

@zouharvi
Copy link
Contributor

zouharvi commented Mar 5, 2024

When specifying the number of GPUs during inference, is it only for parallelism or is the model loaded piece-wise over multiple GPUs, if it's bigger than individual GPUs? For example I'd like to use XCOMET-XXL and our cluster has many 12GB GPUs.

At first I thought that the model parts will be loaded onto all GPUs, e.g.:

comet-score -s data/xcomet_ennl.src -t data/xcomet_ennl_T1.tgt --gpus 5 --model "Unbabel/XCOMET-XL"

However I'm getting GPU OOM on the first GPU:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 10.75 GiB of which 11.62 MiB is free. ...
  1. Is it correct that in the above setting the model is being loaded in full 5 times on all 5 GPUs?
  2. Is there a way to split the model over multiple GPUs?

Thank you!

  • unbabel-comet 2.2.1
  • pytorch-lightning 2.2.0.post0
  • torch 2.2.1
@zouharvi zouharvi added the question Further information is requested label Mar 5, 2024
@zwhe99
Copy link

zwhe99 commented Mar 14, 2024

same question here

@ricardorei
Copy link
Collaborator

Last time I check this was not very easy to do with pytorch-lightning.

We actually used a custom made implementation with FSDP to train these larger models (without using pytorch-lightning). I have to double check if the new versions support FSDP better than the currently used pytorch lightning version (2.2.0.post0).

But short answer: model parallelism is not something we are supporting in the current codebase.

@vince62s
Copy link

idea here. Ctranslate2 just integrated tensor parallelism. It also support XMLRoberta, so just wondering if we could adapt a bit the converter so that we could run the model within CT2 which is very fast.
How different is it from XML Roberta at inference ?

@ricardorei
Copy link
Collaborator

Does it support XLM-R XL? the architecture also differs from XLM-R

@ricardorei
Copy link
Collaborator

It seems like they improved documentation a lot actually: https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html

@vince62s
Copy link

Does it support XLM-R XL? the architecture also differs from XLM-R

we can adapt if we have the detailed description somewhere.
cc @minhthuc2502

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants