Bad performance on training #25

xhzhao · 2017-01-17T01:04:24Z

When i train this model(split=1) on GPU(M40), the training speed is so slow that i can hardly wait for the result. The training speed is about 3789sec/600iter, and the batchsize is 20, so one epoch(121512 images) is about 6075 iters. The total time will be 3121 hour to train the full 250 epochs.

Is this a normal training speed?

The training log is as follows :
iter 600: 2.853175, 3.237587, 0.000398, 3986.911091
iter 1200: 2.764053, 2.890393, 0.000397, 7775.439858
iter 1800: 4.232092, 2.867312, 0.000395, 11630.438926
iter 2400: 3.718163, 2.826459, 0.000394, 15476.820226
iter 3000: 2.618628, 2.725287, 0.000392, 19317.520915
iter 3600: 1.765032, 2.695489, 0.000391, 23153.146176
iter 4200: 2.188749, 2.651189, 0.000389, 26987.516512
iter 4800: 2.323560, 2.658700, 0.000388, 30819.787844
iter 5400: 2.473269, 2.561281, 0.000386, 34647.776586
iter 6000: 1.226942, 2.620815, 0.000385, 38466.522832

When i train the mode from this github, the training speed is very fast and i could train the model to the target accuracy in 2 hours.

zhwis · 2017-03-03T12:22:50Z

Hello，when i run train.lua,but there are the following problems:

HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 140697829553984:
#000: ../../../src/H5Dio.c line 161 in H5Dread(): selection+offset not within extent
major: Dataspace
minor: Out of range
/home/caffe/torch/install/bin/luajit: /home/caffe/torch/install/share/lua/5.1/hdf5/dataset.lua:160: HDF5DataSet:partial() - failed reading data from [HDF5DataSet (83886080 /images_train DATASET)]
stack traceback:
[C]: in function 'assert'
/home/caffe/torch/install/share/lua/5.1/hdf5/dataset.lua:160: in function 'partial'
./misc/DataLoaderDisk.lua:123: in function 'getBatch'
train.lua:245: in function 'lossFun'
train.lua:310: in main chunk
[C]: in function 'dofile'
...affe/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d50

I don't know where the problem is,I use the dataset provided by the author .I hope to get your help

jiasenlu · 2017-03-15T05:21:09Z

Hi @xhzhao I wonder why not use larger batch size ? I think I use 300 as batch size.

jiasenlu · 2017-03-15T05:22:41Z

Hi @zhihuiwisdom I think this means there are some problem with the data prepossessing. Could you tell me how did you generate the Hdf5 data?

zhwis · 2017-03-30T07:06:53Z

@jiasenlu Thank you very much for your reply.you are right,I didn't get HDF5 data，because during the data process of my computer display not enough memory, So can you share the image feature data(.h5) with me? Or do you have any suggestions? Thank you

panfengli · 2017-04-09T01:37:47Z

@zhihuiwisdom
I made a little change as:

local ndims=4096
local batch_size = opt.batch_size
local sz=#img_list_train
local feat_train=torch.FloatTensor(sz, 14, 14, 512) --ndims)
print(string.format('processing %d images...',sz))
for i=1,sz,batch_size do
    xlua.progress(i, sz)
    r=math.min(sz,i+batch_size-1)
    local ims=torch.CudaTensor(r-i+1,3,448,448)
    for j=1,r-i+1 do
        ims[j]=loadim(img_list_train[i+j-1]):cuda()
    end
    net:forward(ims)
    feat_train[{{i,r},{}}]=net.modules[37].output:permute(1,3,4,2):contiguous():float()
    collectgarbage()
end

local train_h5_file = hdf5.open(opt.out_name_train, 'w')
train_h5_file:write('/images_train', feat_train)
train_h5_file:close()
feat_train = nil

local ndims=4096
local batch_size = opt.batch_size
local sz=#img_list_test
local feat_test=torch.FloatTensor(sz,14, 14, 512)
print(string.format('processing %d images...',sz))
for i=1,sz,batch_size do
    xlua.progress(i, sz)
    r=math.min(sz,i+batch_size-1)
    local ims=torch.CudaTensor(r-i+1,3,448,448)
    for j=1,r-i+1 do
        ims[j]=loadim(img_list_test[i+j-1]):cuda()
    end
    net:forward(ims)
    feat_test[{{i,r},{}}]=net.modules[37].output:permute(1,3,4,2):contiguous():float()

    collectgarbage()
end

local test_h5_file = hdf5.open(opt.out_name_test, 'w')
test_h5_file:write('/images_test', feat_test)
test_h5_file:close()
feat_test = nil

For me, it takes about 40 GB CPU memory for the new VQA v1.9 dataset. I think you could try to save the data into hdf5 file iteratively, but not as a whole.

badripatro · 2017-04-09T14:53:45Z

@zhihuiwisdom, size of vqa_data_img_vgg_train.h5 is 49 Gb and vqa_data_img_vgg_test*.h5 is 37Gb *. Since size of files are huge, I am not able to share them. I have extracted image feature by taking batch_size=10. I didn't face any memory issue. please try by decreasing batch_size.

…

On Sun, Apr 9, 2017 at 7:07 AM, Panfeng LI ***@***.***> wrote: @jiasenlu <https://github.com/jiasenlu> For me, it takes about 40 G memory for the new VQA v1.9 dataset. I made a little change as: local ndims=4096 local batch_size = opt.batch_size local sz=#img_list_train local feat_train=torch.FloatTensor(sz, 14, 14, 512) --ndims) print(string.format('processing %d images...',sz)) for i=1,sz,batch_size do xlua.progress(i, sz) r=math.min(sz,i+batch_size-1) local ims=torch.CudaTensor(r-i+1,3,448,448) for j=1,r-i+1 do ims[j]=loadim(img_list_train[i+j-1]):cuda() end net:forward(ims) feat_train[{{i,r},{}}]=net.modules[37].output:permute(1,3,4,2):contiguous():float() collectgarbage() end local train_h5_file = hdf5.open(opt.out_name_train, 'w') train_h5_file:write('/images_train', feat_train) train_h5_file:close() feat_train = nil local ndims=4096 local batch_size = opt.batch_size local sz=#img_list_test local feat_test=torch.FloatTensor(sz,14, 14, 512) print(string.format('processing %d images...',sz)) for i=1,sz,batch_size do xlua.progress(i, sz) r=math.min(sz,i+batch_size-1) local ims=torch.CudaTensor(r-i+1,3,448,448) for j=1,r-i+1 do ims[j]=loadim(img_list_test[i+j-1]):cuda() end net:forward(ims) feat_test[{{i,r},{}}]=net.modules[37].output:permute(1,3,4,2):contiguous():float() collectgarbage() end local test_h5_file = hdf5.open(opt.out_name_test, 'w') test_h5_file:write('/images_test', feat_test) test_h5_file:close() feat_test = nil — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#25 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP3EvMgaF0Zn-g7xXxFc_T967X4kD_41ks5ruDZrgaJpZM4LlH2C> .

-- With Regards, Badri Narayana Patro Ph: +91-9076237295

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance on training #25

Bad performance on training #25

xhzhao commented Jan 17, 2017

zhwis commented Mar 3, 2017 •

edited

Loading

jiasenlu commented Mar 15, 2017

jiasenlu commented Mar 15, 2017

zhwis commented Mar 30, 2017 •

edited

Loading

panfengli commented Apr 9, 2017 •

edited

Loading

badripatro commented Apr 9, 2017 via email

Bad performance on training #25

Bad performance on training #25

Comments

xhzhao commented Jan 17, 2017

zhwis commented Mar 3, 2017 • edited Loading

jiasenlu commented Mar 15, 2017

jiasenlu commented Mar 15, 2017

zhwis commented Mar 30, 2017 • edited Loading

panfengli commented Apr 9, 2017 • edited Loading

badripatro commented Apr 9, 2017 via email

zhwis commented Mar 3, 2017 •

edited

Loading

zhwis commented Mar 30, 2017 •

edited

Loading

panfengli commented Apr 9, 2017 •

edited

Loading