Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Issue about multiple GPUs #1064

Closed
korabelnikov opened this issue Sep 14, 2016 · 8 comments
Closed

Issue about multiple GPUs #1064

korabelnikov opened this issue Sep 14, 2016 · 8 comments

Comments

@korabelnikov
Copy link

Doesn't train with multiple GPUs

  1. create MNIST & LeNet
  2. select few GPUs
  3. get blank instead figures

but it works fine with single gpu. tested with either caffe and torch engines.

image

@lukeyeager
Copy link
Member

So it just hangs? Do you get any error messages in the Caffe/Torch logs?

Which GPUs do you have? You can use digits/device_query.py.

@korabelnikov
Copy link
Author

korabelnikov commented Sep 14, 2016

@lukeyeager yes. i have left it on few hours and get this
image
i have 4 k80 gpu

torch log:

tput: No value for $TERM and no -T specified
2016-09-14 15:22:50 [INFO ] Loading mean tensor from /usr/share/digits/digits/jobs/20160914-152248-263f/mean.jpg file
2016-09-14 15:22:50 [INFO ] Loading label definitions from /usr/share/digits/digits/jobs/20160914-150454-2a0a/labels.txt file
2016-09-14 15:22:50 [INFO ] found 10 categories
2016-09-14 15:22:50 [INFO ] creating data readers
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] found 45002 images in train db/usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] found 14998 images in train db/usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:51 [INFO ] Loading network definition from /usr/share/digits/digits/jobs/20160914-152248-263f/model
Using CuDNN backend
2016-09-14 15:22:51 [INFO ] Train batch size is 64 and validation batch size is 32
2016-09-14 15:22:51 [INFO ] Network definition:
DataParallelTable: 2 x nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output]
(1): nn.MulConstant
(2): cudnn.SpatialConvolution(1 -> 20, 5x5)
(3): cudnn.SpatialMaxPooling(2x2, 2,2)
(4): cudnn.SpatialConvolution(20 -> 50, 5x5)
(5): cudnn.SpatialMaxPooling(2x2, 2,2)
(6): nn.View(-1)
(7): nn.Linear(800 -> 500)
(8): cudnn.ReLU
(9): nn.Linear(500 -> 10)
(10): nn.LogSoftMax
}
2016-09-14 15:22:51 [INFO ] Network definition ends
2016-09-14 15:22:51 [INFO ] switching to CUDA
2016-09-14 15:22:52 [INFO ] initializing the parameters for learning rate policy: step
2016-09-14 15:22:52 [INFO ] initializing the parameters for Optimizer
2016-09-14 15:22:52 [INFO ] During training. details will be logged after every 5000 images
2016-09-14 15:22:52 [INFO ] Training epochs to be completed for each validation : 1
2016-09-14 15:22:52 [INFO ] Training epochs to be completed before taking a snapshot : 1
2016-09-14 15:22:52 [INFO ] While logging, epoch value will be rounded to 3 significant digits
2016-09-14 15:22:52 [INFO ] started training the model
2016-09-14 15:23:39 [INFO ] Validation (epoch 0): loss = -14.095491760067, accuracy = 0.11988265102014
2016-09-14 15:23:39 [INFO ] Training (epoch 0.001): loss = 1.1556785106659, lr = 0.01

@lukeyeager
Copy link
Member

Looks like you're using the deb package to install, right? Can you send me the output of this command:

$ dpkg -l | grep 'cudart\|libcudnn\|libnccl\|caffe\|torch\|digits'

@korabelnikov
Copy link
Author

korabelnikov commented Sep 15, 2016

@lukeyeager I'm using image of nvidia-docker digits.

root@15966d15b7e8:/usr/share/digits# dpkg -l | grep 'cudart\|libcudnn\|libnccl\|caffe\|torch\|digi      ts'
ii  caffe-nv                           0.15.9-1+cuda7.5                        amd64        Fast open framework for Deep Learning
ii  caffe-nv-tools                     0.15.9-1+cuda7.5                        amd64        Fast open framework for Deep Learning (Tools)
ii  cuda-cudart-7-5                    7.5-18                                  amd64        CUDA Runtime native Libraries
ii  digits                             4.0.0-1                                 amd64        NVIDIA       DIGITS webserver
ii  libcaffe-nv0                       0.15.9-1+cuda7.5                        amd64        Fast o      pen framework for Deep Learning (Libs)
ii  libcudnn5                          5.1.3-1+cuda7.5                         amd64        cuDNN       runtime libraries
ii  libnccl1                           1.2.3-1+cuda7.5                         amd64        NVIDIA       Collectives Communication Library (NCCL) Runtime
ii  python-caffe-nv                    0.15.9-1+cuda7.5                        amd64        Fast o      pen framework for Deep Learning (Python)
ii  torch7-nv                          0.9.99-1+cuda7.5                        amd64        NVidia       Torch Bundle (with CUDA). Made for DIGITS.

@lukeyeager
Copy link
Member

Software looks fine. I bet it's a GPU and/or system problem.

Do you have a particularly fancy motherboard? See NVIDIA/caffe#10 - that might be related.

@korabelnikov
Copy link
Author

@lukeyeager thanks, i will try

@korabelnikov
Copy link
Author

@mpkh , please take a look

@korabelnikov
Copy link
Author

NVIDIA/caffe#10 it's solve the issue

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants