Multi-GPU slow down. #61

shuni1001 · 2016-05-29T20:07:12Z

I'm trying to train the network using 8 GPU machine.
However, I faced to weird situation.

In the first epoch, the computation time is very steady (~about 0.5 second) and GPU utilization (using nvidia-smi) is also steady. All of 8 GPU are utilized 100%.

But after some epoch (some epoch is fine, but after 1 or 2 epoch, it happened) the computation time is varying in huge gaps and GPU utilization also vary. Even the data time is near zero (~about 0.001 second).

So, when it happened (slow down), the restarting main script(main.lua) is only solution for resolve this problem. Interestingly, after restarting main script it go back to normal.

This is not happened during epoch. This 'slow down' appear on transition of training epoch.

Could anybody give me a hint for resolve this problem?
I'm currently using NCCL and CUDNN v4.

colesbury · 2016-05-31T17:01:23Z

That's quite strange. Could the GPUs be getting throttled due to high temperatures?

shuni1001 · 2016-05-31T18:44:08Z

Thank you for reply.
Temperatures of GPUs are quite acceptable I think. I'm using 8 Titan X and the temperature of each GPU is under 75 degree (Celsius). So I think these are not throttled.

When the "slow down" happened the utilization of GPUs is different each other (for example, when GPU1 is utilized more thean 90%, other GPUs are utilized less than 20%. It's not happened in the first epoch.)
Thus, I'm guessing that at the second epoch GPUs are not synchronized well and that make degeneration.

Is there any way to check GPU sync?
I attached GPU topology of my system.

colesbury · 2016-05-31T19:15:27Z

Strange. It sounds like it might be a NCCL or DataParallelTable issue. Does it always happen on the second epoch? What if you decrease the size of an epoch?

shuni1001 · 2016-06-01T12:52:27Z

I read other closed issue (#62).
To check Data Parallel Table issue, I ran the test_DataParallelTable.lua script, and get below result (error result).

So resolve these errors, I re-install cunn package and run test_DataParallelTable.lua again.

nogradInput Error is disappeared but still profile data paralleltable error is still happened.

colesbury · 2016-06-01T15:35:40Z

I think that's just a bug in the test code

shuni1001 · 2016-06-01T16:03:12Z

After reinstalling, I tested training code in smaller dataset.

Some slow downs are still appeared, but it recover few(about 20) minibatch later unlike training full dataset before update cunn (at that time it seems to be never recovered).
Also it isn't happened second epoch. It just happened some epoch and some minibatch.
and if it appeared previous epoch, it happen every epoch.

I'll test on the full dataset to confirm.

shuni1001 · 2016-06-02T18:40:24Z

On the full imagenet dataset, still there are slow down.
It was not happened on second epoch, but it started with starting point of epoch.

I added to sys.sleep(100) after saving model, and slow down is not appeared until now (about 7 epoch.)
I wish this problem is solved by sys.sleep but I cannot sure it.

shuni1001 · 2016-06-03T15:15:32Z

it happened again after 20 epoch.

I don't know why does it happened.
I can say it is started with some epoch.
When it happened the data time increase little bit, normally it has steady value less than 0.005, but when the slow down is appeared, it increased randomly. (the largest value was 4 second, but when the data time is near zero, computation time was doubled or tripled.)

xternalz · 2016-06-23T09:33:09Z

I'm facing this problem too, and I found out that all the overhead time is spent on copying data to GPUs (Trainer:copyInputs). It usually happens after the first epoch.

Here's the list of things I tried, but to no avail:

not saving checkpoints/models
disabling NCCL
use smaller number of GPUs (instead of 8 GPUs)
shareGradInput turned off

karandwivedi42 · 2016-08-13T01:36:50Z

@shuni1001 @xternalz did any of you solve this?

shuni1001 · 2016-08-16T09:34:21Z

No, same problem happened every experiments.
this problem appears more frequently when I use more GPUs, thus I'm trying to use less number of GPUs currently.

colesbury · 2016-08-16T17:00:18Z

Can you please report:

The GPU model
The operating system and version you're using
The CUDA and cuDNN versions you're using

I haven't seen this myself and don't know what's causing it, but maybe we can find something common in the setups

shuni1001 · 2016-08-16T17:05:59Z

The configurations of my machine are

GPU model: Titan X Founder's Edition * 8
System: Gigabyte G250
Operating system: Ubuntu 14.04.4 LTS
Kernel: 4.2.0-36-generic
Driver: 361.42
cuDNN: v4
CUDA version: 7.5.17

alexkongy · 2016-09-02T18:39:29Z

I am having exact the same problem when training from scratch. It happens in the first epoch. Data loading takes up to 4.5 seconds sometime but it usually takes 0.001 seconds. My configuration is

-Ubuntu 14.04
-cuDNN v5
-CUDA 7.5
-GPU: GeForce Titan X * 2
-Driver 352.63

szagoruyko · 2016-09-02T19:00:56Z

@alexkongy you can try clearing page memory periodically https://gist.github.com/szagoruyko/4cbbba5ac8b53b0fe32f43e7b3d0cda6

alexkongy · 2016-09-05T13:50:45Z

@szagoruyko Thanks. I will try it.

bearpaw · 2016-10-25T02:26:19Z

Same problem here. And it's not a memory problem (I have 512G memory).

The data load time is very unstable after some iterations. Even increasing the nThreads number does not help.

Coldmooon · 2017-02-28T05:34:52Z

Facing the same problem when training the ResNet-50 on Imagenet (stored in a 512G SSD). I found that the speed of loading files is very low, only 7-8M/s (and 2M/S sometimes), which is far from the need of training on Imagenet. Changing batchsize and nThreads doesn't work.

But if I train the same model through LMDB database using Caffe, the loading speed can reach to 40M/s.
I'm not sure if this problem is because reading lots of small files reduces the performance of disk.

pedropgusmao · 2017-07-29T15:39:03Z

@szagoruyko's solution worked for me. I would recommend putting some flashing lights, whistles and flags around this answer or at least adding it to the README file. It is a life saver.

dex1990 · 2017-10-22T02:39:19Z

Hi ! @szagoruyko, it seems that your solution link(https://gist.github.com/szagoruyko/4cbbba5ac8b53b0fe32f43e7b3d0cda6) rot. Could you please send it again? Thanks!

Coldmooon · 2017-12-26T11:39:34Z

Nearly one year of struggle against this problem, finally I solved it on my PC and our lab's server.
The solution given by @szagoruyko is right. But simply using it in crontab is not enough. It has to depend on some other factors. After lots of attempts, I find the following is very important.

SSD is crucial!

I compared training on HDD to SSD and found that when training on HDD, even clearing page memory periodically, the slowing down will appear shortly after the training. This means that the clearing page memory command doesn't work in this case.

Then, I switched the training onto SSD, and do not change any other elements or configurations. As expected, the entire training process is very smooth. There is no any lag in this case.

Keep page memory clear before training.
It seems that Once you meet with this problem at some time, the slowing down will be always there for the rest of training, even you start clearing page memory periodically at this moment. So keeping page memory clear and a clean environment are important.
Clear page memory periodically.
If the training process is healthy and smooth, you will find that the cached memory is increasing linearly.

free -mh -s 1

Use the above command to monitor the cached memory. So clearing it periodically is helpful.

lim0606 mentioned this issue Jun 11, 2016

Error for creating ImageNet LMDB lim0606/torch-inception-resnet-v2#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU slow down. #61

Multi-GPU slow down. #61

shuni1001 commented May 29, 2016 •

edited

Loading

colesbury commented May 31, 2016

shuni1001 commented May 31, 2016

colesbury commented May 31, 2016

shuni1001 commented Jun 1, 2016

colesbury commented Jun 1, 2016

shuni1001 commented Jun 1, 2016 •

edited

Loading

shuni1001 commented Jun 2, 2016

shuni1001 commented Jun 3, 2016

xternalz commented Jun 23, 2016 •

edited

Loading

karandwivedi42 commented Aug 13, 2016

shuni1001 commented Aug 16, 2016

colesbury commented Aug 16, 2016

shuni1001 commented Aug 16, 2016

alexkongy commented Sep 2, 2016

szagoruyko commented Sep 2, 2016

alexkongy commented Sep 5, 2016

bearpaw commented Oct 25, 2016 •

edited

Loading

Coldmooon commented Feb 28, 2017 •

edited

Loading

pedropgusmao commented Jul 29, 2017

dex1990 commented Oct 22, 2017

Coldmooon commented Dec 26, 2017 •

edited

Loading

Multi-GPU slow down. #61

Multi-GPU slow down. #61

Comments

shuni1001 commented May 29, 2016 • edited Loading

colesbury commented May 31, 2016

shuni1001 commented May 31, 2016

colesbury commented May 31, 2016

shuni1001 commented Jun 1, 2016

colesbury commented Jun 1, 2016

shuni1001 commented Jun 1, 2016 • edited Loading

shuni1001 commented Jun 2, 2016

shuni1001 commented Jun 3, 2016

xternalz commented Jun 23, 2016 • edited Loading

karandwivedi42 commented Aug 13, 2016

shuni1001 commented Aug 16, 2016

colesbury commented Aug 16, 2016

shuni1001 commented Aug 16, 2016

alexkongy commented Sep 2, 2016

szagoruyko commented Sep 2, 2016

alexkongy commented Sep 5, 2016

bearpaw commented Oct 25, 2016 • edited Loading

Coldmooon commented Feb 28, 2017 • edited Loading

pedropgusmao commented Jul 29, 2017

dex1990 commented Oct 22, 2017

Coldmooon commented Dec 26, 2017 • edited Loading

shuni1001 commented May 29, 2016 •

edited

Loading

shuni1001 commented Jun 1, 2016 •

edited

Loading

xternalz commented Jun 23, 2016 •

edited

Loading

bearpaw commented Oct 25, 2016 •

edited

Loading

Coldmooon commented Feb 28, 2017 •

edited

Loading

Coldmooon commented Dec 26, 2017 •

edited

Loading