Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Training time error in multi gpu #444

Closed
legolas123 opened this issue Nov 23, 2017 · 1 comment
Closed

Training time error in multi gpu #444

legolas123 opened this issue Nov 23, 2017 · 1 comment

Comments

@legolas123
Copy link

legolas123 commented Nov 23, 2017

Installed the latest caffe-0.16 with cuda 8, cudnn 7 and nccl-1.3.4-1. With ImageData layer as follows:

 layer {
  name: "data"  
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 224
    mean_value: 103.940002441
    mean_value: 116.779998779
    mean_value: 123.680000305
  }
  image_data_param {
    source: "/home/ubuntu/caffe_1/models/project/train.txt"
    batch_size: 12
    shuffle: true
    new_height: 240
    new_width: 240
  }
}

The command run is

./build/tools/caffe train -solver models/seres_inception/solver.prototxt -gpu=0,1

This results in following error message

F1123 09:52:47.834031 15880 syncedmem.cpp:178] Check failed: Caffe::current_device() == gpu_device_ (0 vs. 1) 

With the option -gpu=all, it randomly gets stuck without any error message.
And the last snippet of output where it gets stuck is

I1123 09:54:54.429864 18272 common.cpp:228] New stream 0x7fe958334c20 on device 3, thread 140642958915328
I1123 09:54:54.432126 18048 common.cpp:228] New stream 0x7fea1c33a420 on device 0, thread 140644109477632
I1123 09:54:54.439190 18271 common.cpp:228] New stream 0x7fe954334c20 on device 2, thread 140641080178432
I1123 09:54:54.441017 18273 common.cpp:228] New stream 0x7fe960001060 on device 0, thread 140641071785728

The issue #357 seems to suggest that this type of deadlock is already solved with the latest version. Can somebody please help me in figuring out what could be the problem.

@drnikolaev
Copy link

Please check v0.16.5 and reopen the issue if the problem still exists.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants