-
Notifications
You must be signed in to change notification settings - Fork 665
Multi-GPU slow down. #61
Comments
That's quite strange. Could the GPUs be getting throttled due to high temperatures? |
Strange. It sounds like it might be a NCCL or DataParallelTable issue. Does it always happen on the second epoch? What if you decrease the size of an epoch? |
I read other closed issue (#62). So resolve these errors, I re-install cunn package and run test_DataParallelTable.lua again. nogradInput Error is disappeared but still profile data paralleltable error is still happened. |
I think that's just a bug in the test code |
After reinstalling, I tested training code in smaller dataset. Some slow downs are still appeared, but it recover few(about 20) minibatch later unlike training full dataset before update cunn (at that time it seems to be never recovered). I'll test on the full dataset to confirm. |
On the full imagenet dataset, still there are slow down. I added to sys.sleep(100) after saving model, and slow down is not appeared until now (about 7 epoch.) |
it happened again after 20 epoch. I don't know why does it happened. |
I'm facing this problem too, and I found out that all the overhead time is spent on copying data to GPUs ( Here's the list of things I tried, but to no avail:
|
@shuni1001 @xternalz did any of you solve this? |
No, same problem happened every experiments. |
Can you please report:
I haven't seen this myself and don't know what's causing it, but maybe we can find something common in the setups |
The configurations of my machine are
|
I am having exact the same problem when training from scratch. It happens in the first epoch. Data loading takes up to 4.5 seconds sometime but it usually takes 0.001 seconds. My configuration is -Ubuntu 14.04 |
@alexkongy you can try clearing page memory periodically https://gist.github.com/szagoruyko/4cbbba5ac8b53b0fe32f43e7b3d0cda6 |
@szagoruyko Thanks. I will try it. |
@szagoruyko's solution worked for me. I would recommend putting some flashing lights, whistles and flags around this answer or at least adding it to the README file. It is a life saver. |
Hi ! @szagoruyko, it seems that your solution link(https://gist.github.com/szagoruyko/4cbbba5ac8b53b0fe32f43e7b3d0cda6) rot. Could you please send it again? Thanks! |
Nearly one year of struggle against this problem, finally I solved it on my PC and our lab's server.
I compared training on HDD to SSD and found that when training on HDD, even clearing page memory periodically, the Then, I switched the training onto SSD, and do not change any other elements or configurations. As expected, the entire training process is very smooth. There is no any lag in this case.
Use the above command to monitor the |
I'm trying to train the network using 8 GPU machine.
However, I faced to weird situation.
In the first epoch, the computation time is very steady (~about 0.5 second) and GPU utilization (using nvidia-smi) is also steady. All of 8 GPU are utilized 100%.
But after some epoch (some epoch is fine, but after 1 or 2 epoch, it happened) the computation time is varying in huge gaps and GPU utilization also vary. Even the data time is near zero (~about 0.001 second).
So, when it happened (slow down), the restarting main script(main.lua) is only solution for resolve this problem. Interestingly, after restarting main script it go back to normal.
This is not happened during epoch. This 'slow down' appear on transition of training epoch.
Could anybody give me a hint for resolve this problem?
I'm currently using NCCL and CUDNN v4.
The text was updated successfully, but these errors were encountered: