Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Multi-GPU training #311

Open
ghost opened this issue Jul 16, 2020 · 21 comments
Open

Multi-GPU training #311

ghost opened this issue Jul 16, 2020 · 21 comments

Comments

@ghost
Copy link

ghost commented Jul 16, 2020

I have trained SBERT model from scratch using the code https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_nli.py and https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark_continue_training.py on a single GPU.

Now, I would like to train the model from scratch using two GPUs. I'm not sure regarding the changes I have to makes in the above code so that I can train the model using two GPUs.

@nreimers

@nreimers
Copy link
Member

Hi @kalyanks0611

I did some preliminary experiments with wrapping the model in DataParallel and training on two GPUs.

However, the speed was worse compared to training on a single GPU. So I didn't follow up on this.

If someone gets this working (+ speedup compared to training on one GPU), I would be happy if the code could be shared here.

@ghost
Copy link
Author

ghost commented Jul 17, 2020

In general, when a model is trained using multiple GPU, training should be much faster. Any thoughts on, " why the speed was worse compared to training on a single GPU?"
@nreimers

@nreimers
Copy link
Member

Hi @kalyanks0611
A challenge when training on multi-GPU is the communication overhead between the two GPUs. Often, sending data from one to the other GPU is quite slow. After each gradient step, the gradients are synced between the GPUs. This drastically decreases the performance.

At least in 2017, Pytorch DataParallel was not really efficient:
facebookresearch/fairseq#34

I don't know if this has improved since then. As mentioned, on the servers I tested, I saw a significant speed drop. Maybe this has changed with more recent versions of Pytorch / Transformers.

@zhangdan8962
Copy link

What about using DistributedDataParallel?

@nreimers
Copy link
Member

DistributedDataParallel is for having multiple servers. Haven't tested that, but there the communication overhead is even larger.

@zhangdan8962
Copy link

In fact, DDP also can be used on one machine. And as stated in the following tutorial, DDP is faster than DataParallel even on a single node.
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

@nreimers
Copy link
Member

Hi @zhangdan8962
That is interesting. I will have a look

@ghost
Copy link
Author

ghost commented Jul 17, 2020

To overcome the issue in DataParallel, there is a PyTorch package called PyTorch-Encoding.

from parallel import DataParallelModel, DataParallelCriterion

parallel_model = DataParallelModel(model)             # Encapsulate the model
parallel_loss  = DataParallelCriterion(loss_function) # Encapsulate the loss function

predictions = parallel_model(inputs)      # Parallel forward pass
                                          # "predictions" is a tuple of n_gpu tensors
loss = parallel_loss(predictions, labels) # Compute loss function in parallel
loss.backward()                           # Backward pass
optimizer.step()                          # Optimizer step
predictions = parallel_model(inputs)      # Parallel forward pass with new parameters

(this code taken from https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255 )

@nreimers

@liuyukid
Copy link

A simple implementation:https://github.com/liuyukid/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py
I don't know if the speed can be improved, but at least support larger batch_size
You can try it!

@genaunit
Copy link

Hi, anyone had success with parallelizing SentenceTransformer training to multiple GPUs using the PyTorch-Encoding approach that @kalyanks0611 brought up two comments above?

@ajmcgrail
Copy link

Hey, +1ing the above comment, any update on multi gpu training?

@genaunit
Copy link

Hey @challos , I was able to make it work using a pretty ancient version of sentence transformers (0.38 because I had to). I think that if you can use the up to date version, they have some native multi-GPU support. If not, I found this article from one of Huggingface guys instrumental. He refers to a piece of code from zhanghang1989 (on github), which I was able to use almost verbatim (I think there was a small bug there for my use case but it is mostly useable a is - if you see a crash you'll know how to fix it):

https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

Get through the explanation in that article - it is somewhat dense but useful in the end. And the code does just that.

@prvnktech
Copy link

Do we have any update on Multi GPU Training?

@shoegazerstella
Copy link

Any update on this? thanks

@liqi6811
Copy link

A simple implementation:https://github.com/liuyukid/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py I don't know if the speed can be improved, but at least support larger batch_size You can try it!

I tried this code, to train on 1 worker 4 GPUs, it is not faster, about the same speed as 1 worker 1 GPU. Anybody has good ideas?

@sangyongjia
Copy link

can not find a solution.

@dkchhetri
Copy link

https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

Got same result here with 4GPU, no acceleration (only the batch size increased by 4x)

@zhanxlin
Copy link

Hi @kalyanks0611

I did some preliminary experiments with wrapping the model in DataParallel and training on two GPUs.

However, the speed was worse compared to training on a single GPU. So I didn't follow up on this.

If someone gets this working (+ speedup compared to training on one GPU), I would be happy if the code could be shared here.

Hi, Will you implement multi-GPU code? Because with the improvement of computing resources, everyone is no longer satisfied with using 2 GPUs, but uses more GPUs.

@tomaarsen
Copy link
Collaborator

tomaarsen commented May 10, 2024

Hello @zhanxlin,

Multi-GPU support is being introduced in the upcoming v3.0 release of Sentence Transformers (planned in a few weeks). See v3.0-pre-release for the code, in case you already want to play around with it. I think the following should work:

pip install git+https://github.com/UKPLab/sentence-transformers@v3.0-pre-release

There's some details in #2449 about how the training will be changed, and how to use MultiGPU training. But to give you a sneak peek on the latter:

  • Data Parallelism is automatically applied if you use multiple GPUs
  • Distributed Data Parallelism is automatically applied if you run the training script with torchrun or accelerate instead of python.

As you can imagine, this results in very notable training speedups.

  • Tom Aarsen

@bely66
Copy link

bely66 commented May 27, 2024

Hi @tomaarsen
Any idea when the release exact date is?

@tomaarsen
Copy link
Collaborator

Hello @bely66,

I'm preparing for the release to be this week. I can't promise an exact date as there might be some unexpected issues.

  • Tom Aarsen

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests