Skip to content
This repository has been archived by the owner on Oct 30, 2019. It is now read-only.

Multi-GPU slow down. #61

Open
shuni1001 opened this issue May 29, 2016 · 21 comments
Open

Multi-GPU slow down. #61

shuni1001 opened this issue May 29, 2016 · 21 comments

Comments

@shuni1001
Copy link

shuni1001 commented May 29, 2016

logs

I'm trying to train the network using 8 GPU machine.
However, I faced to weird situation.

In the first epoch, the computation time is very steady (~about 0.5 second) and GPU utilization (using nvidia-smi) is also steady. All of 8 GPU are utilized 100%.

But after some epoch (some epoch is fine, but after 1 or 2 epoch, it happened) the computation time is varying in huge gaps and GPU utilization also vary. Even the data time is near zero (~about 0.001 second).

So, when it happened (slow down), the restarting main script(main.lua) is only solution for resolve this problem. Interestingly, after restarting main script it go back to normal.

This is not happened during epoch. This 'slow down' appear on transition of training epoch.

Could anybody give me a hint for resolve this problem?
I'm currently using NCCL and CUDNN v4.

@colesbury
Copy link
Contributor

That's quite strange. Could the GPUs be getting throttled due to high temperatures?

@shuni1001
Copy link
Author

Thank you for reply.
Temperatures of GPUs are quite acceptable I think. I'm using 8 Titan X and the temperature of each GPU is under 75 degree (Celsius). So I think these are not throttled.

When the "slow down" happened the utilization of GPUs is different each other (for example, when GPU1 is utilized more thean 90%, other GPUs are utilized less than 20%. It's not happened in the first epoch.)
Thus, I'm guessing that at the second epoch GPUs are not synchronized well and that make degeneration.

Is there any way to check GPU sync?
I attached GPU topology of my system.
topo

@colesbury
Copy link
Contributor

Strange. It sounds like it might be a NCCL or DataParallelTable issue. Does it always happen on the second epoch? What if you decrease the size of an epoch?

@shuni1001
Copy link
Author

I read other closed issue (#62).
To check Data Parallel Table issue, I ran the test_DataParallelTable.lua script, and get below result (error result).

cunn_error

So resolve these errors, I re-install cunn package and run test_DataParallelTable.lua again.
reinstall_cunn

nogradInput Error is disappeared but still profile data paralleltable error is still happened.

@colesbury
Copy link
Contributor

I think that's just a bug in the test code

@shuni1001
Copy link
Author

shuni1001 commented Jun 1, 2016

After reinstalling, I tested training code in smaller dataset.

smallset

Some slow downs are still appeared, but it recover few(about 20) minibatch later unlike training full dataset before update cunn (at that time it seems to be never recovered).
Also it isn't happened second epoch. It just happened some epoch and some minibatch.
and if it appeared previous epoch, it happen every epoch.

I'll test on the full dataset to confirm.

@shuni1001
Copy link
Author

On the full imagenet dataset, still there are slow down.
It was not happened on second epoch, but it started with starting point of epoch.

I added to sys.sleep(100) after saving model, and slow down is not appeared until now (about 7 epoch.)
I wish this problem is solved by sys.sleep but I cannot sure it.

@shuni1001
Copy link
Author

it happened again after 20 epoch.

I don't know why does it happened.
I can say it is started with some epoch.
When it happened the data time increase little bit, normally it has steady value less than 0.005, but when the slow down is appeared, it increased randomly. (the largest value was 4 second, but when the data time is near zero, computation time was doubled or tripled.)

@xternalz
Copy link

xternalz commented Jun 23, 2016

I'm facing this problem too, and I found out that all the overhead time is spent on copying data to GPUs (Trainer:copyInputs). It usually happens after the first epoch.

Here's the list of things I tried, but to no avail:

  • not saving checkpoints/models
  • disabling NCCL
  • use smaller number of GPUs (instead of 8 GPUs)
  • shareGradInput turned off

@karandwivedi42
Copy link

@shuni1001 @xternalz did any of you solve this?

@shuni1001
Copy link
Author

No, same problem happened every experiments.
this problem appears more frequently when I use more GPUs, thus I'm trying to use less number of GPUs currently.

@colesbury
Copy link
Contributor

Can you please report:

  • The GPU model
  • The operating system and version you're using
  • The CUDA and cuDNN versions you're using

I haven't seen this myself and don't know what's causing it, but maybe we can find something common in the setups

@shuni1001
Copy link
Author

The configurations of my machine are

  • GPU model: Titan X Founder's Edition * 8
  • System: Gigabyte G250
  • Operating system: Ubuntu 14.04.4 LTS
  • Kernel: 4.2.0-36-generic
  • Driver: 361.42
  • cuDNN: v4
  • CUDA version: 7.5.17

@alexkongy
Copy link

I am having exact the same problem when training from scratch. It happens in the first epoch. Data loading takes up to 4.5 seconds sometime but it usually takes 0.001 seconds. My configuration is

-Ubuntu 14.04
-cuDNN v5
-CUDA 7.5
-GPU: GeForce Titan X * 2
-Driver 352.63

@szagoruyko
Copy link
Contributor

@alexkongy you can try clearing page memory periodically https://gist.github.com/szagoruyko/4cbbba5ac8b53b0fe32f43e7b3d0cda6

@alexkongy
Copy link

@szagoruyko Thanks. I will try it.

@bearpaw
Copy link

bearpaw commented Oct 25, 2016

Same problem here. And it's not a memory problem (I have 512G memory).

image

The data load time is very unstable after some iterations. Even increasing the nThreads number does not help.

@Coldmooon
Copy link

Coldmooon commented Feb 28, 2017

Facing the same problem when training the ResNet-50 on Imagenet (stored in a 512G SSD). I found that the speed of loading files is very low, only 7-8M/s (and 2M/S sometimes), which is far from the need of training on Imagenet. Changing batchsize and nThreads doesn't work.
screen shot 2017-02-28 at 13 16 26

But if I train the same model through LMDB database using Caffe, the loading speed can reach to 40M/s.
I'm not sure if this problem is because reading lots of small files reduces the performance of disk.

@pedropgusmao
Copy link

@szagoruyko's solution worked for me. I would recommend putting some flashing lights, whistles and flags around this answer or at least adding it to the README file. It is a life saver.

@dex1990
Copy link

dex1990 commented Oct 22, 2017

Hi ! @szagoruyko, it seems that your solution link(https://gist.github.com/szagoruyko/4cbbba5ac8b53b0fe32f43e7b3d0cda6) rot. Could you please send it again? Thanks!

@Coldmooon
Copy link

Coldmooon commented Dec 26, 2017

Nearly one year of struggle against this problem, finally I solved it on my PC and our lab's server.
The solution given by @szagoruyko is right. But simply using it in crontab is not enough. It has to depend on some other factors. After lots of attempts, I find the following is very important.

  1. SSD is crucial!

I compared training on HDD to SSD and found that when training on HDD, even clearing page memory periodically, the slowing down will appear shortly after the training. This means that the clearing page memory command doesn't work in this case.

Then, I switched the training onto SSD, and do not change any other elements or configurations. As expected, the entire training process is very smooth. There is no any lag in this case.

  1. Keep page memory clear before training.
    It seems that Once you meet with this problem at some time, the slowing down will be always there for the rest of training, even you start clearing page memory periodically at this moment. So keeping page memory clear and a clean environment are important.

  2. Clear page memory periodically.
    If the training process is healthy and smooth, you will find that the cached memory is increasing linearly.

free -mh -s 1

Use the above command to monitor the cached memory. So clearing it periodically is helpful.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants