Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Why not data_parallel? #34

Closed
LMescheder opened this issue Oct 20, 2017 · 4 comments
Closed

Why not data_parallel? #34

LMescheder opened this issue Oct 20, 2017 · 4 comments

Comments

@LMescheder
Copy link

I wonder why you implemented the multi GPU training using a custom Event Loop instead of using torch.nn.DataParallel. I suppose it is for performance reasons?
If so, what is the main bottleneck in data_parallel that prevents you from using it? Do you have an estimate of how much the speed up compared to the (simpler) DataParallel solution is?

@myleott
Copy link

myleott commented Oct 20, 2017

Yes, it's for performance reasons.

DataParallel relies on Python threading, which is slow due to the GIL [1][2]. When we tried nn.DataParallel initially, we saw negative speedup with multiple GPUs (e.g., one GPU training was faster than using four GPUs).

The custom event loop in fairseq-py uses multiprocessing (i.e., one Process per GPU), which gets around the GIL and gives much better multi-GPU performance. We typically see ~5.5-6x speedup with 8 GPUs.

[1] OpenNMT/OpenNMT-py#89 (comment)
[2] pytorch/pytorch#54

@LMescheder
Copy link
Author

Okay, I see. Thanks for the prompt reply. Have you tried DistributedDataParallel which shouldn't have issues with the GIL? Amazing work by the way!

@myleott
Copy link

myleott commented Oct 24, 2017

I haven't tried DistributedDataParallel yet, but it looks promising. I'll look into it when I get some time. This discussion also seems relevant (albeit a little discouraging):

@myleott myleott closed this as completed Oct 24, 2017
@jekbradbury
Copy link

It's definitely worked for our use cases, including speech and MT. I think it's ultimately very similar to the implementation you built into fairseq, except that the user must explicitly launch N copies of the script, and each copy should have its own data loader or data loader shard.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants