Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ImageDataLayer deadlock problem #357

Closed
smuelpeng opened this issue Jun 18, 2017 · 11 comments
Closed

ImageDataLayer deadlock problem #357

smuelpeng opened this issue Jun 18, 2017 · 11 comments
Labels

Comments

@smuelpeng
Copy link

Hello,I've used nvidia-caffe for months, and I clone the newest version caff-0.16 this week.
But I found that image data layer in nvidiacaffe can't work well but DataLayer with LMDB works as usual (0.5 time faster in our K40 servers,that's very cool).
When I try imagedatalayer in a simple 5-class classification work by using resnet18,prototxt like:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
transform_param {
scale:1
crop_size: 224
}
image_data_param {
source: "data/age_5/train.txt"
batch_size: 64
mirror: true
shuffle: true
new_height: 250
new_width: 250
root_folder: "/home/yuzhipeng/data/"
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 3
kernel_size: 7
stride: 2
weight_filler {
type: "xavier"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
·····
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "classifier"
bottom: "label"
top: "loss"
}

Well,I found that after first batch, work remain stagnant in:
4510 I0618 23:22:02.069660 28888 image_data_layer.cpp:79] output data size: 32, 3,224, 224
and I debug base_data_layer.cpp by setting:
image
I found that " qid:0 queryid:0" has printed only once.
So, I suspect that you may reuse the imagedatalayer thread by more than one time by using shared_ptr unproprtly.
or someting like lock and didn't unlock it. or something else.
I hope you can check this issue or point some mistake I've made.
Thank you!

@drnikolaev
Copy link

Hi @smuelpeng thanks!
Seems like a bug, we are looking to it..

@drnikolaev drnikolaev added the bug label Jun 19, 2017
@mathmanu
Copy link

mathmanu commented Jun 20, 2017

I faced a similar issue when I was creating a custom layer. Finally it worked when I commented out the highlighted portion of the code under
if (this->auto_mode_)
after overriding InternalThreadEntryN and also set threads and parser_threads to 1.

template<typename Ftype, typename Btype>
void ImageLabelListDataLayer<Ftype, Btype>::InternalThreadEntryN(size_t thread_id) {
.
.
.
//comment this out to avoid hang.
//if (this->auto_mode_) {
// break;
//} // manual otherwise, thus keep rolling

iter0 = false;
.
.
.
}

@drnikolaev
Copy link

@smuelpeng @mathmanu This was reproduced and fixed. Pre-release: https://github.com/drnikolaev/caffe/tree/caffe-0.16
Please verify. Thank you.

@mathmanu
Copy link

@drnikolaev I can see that you have added a fix in DataLayer. But the layer that I am using is similar to ImageDataLayer. It derives directly from BasePrefetchingDataLayer

So the fix that you applied doesn't help me - it still hangs.

Shouldn't you be applying the fix to BasePrefetchingDataLayer since DataLayer and ImageDataLayer are derives from this class?

@drnikolaev
Copy link

@mathmanu The hang I reproduced was fixed by this commit:

drnikolaev@232d38b#diff-8c16c57bbe3538ff698add14cb67a7dd

Seems like there is something else. May I ask you to give me self-contained sample?

@mathmanu
Copy link

scripts_debug_nvcaffe_issue357_v1.zip

Attachment is a self contained example.

  1. Unzip the attachment
  2. copy the image_label_list_data_layer.cpp and image_label_list_data_layer.hpp into the respective folders in your caffe.
  3. build
  4. change directory into script_debug folder (of the unzipped attachment)
  5. ./run_all.sh
    you can see that it hangs
  6. to remove the hang, open image_label_list_data_layer.cpp and find the line with comment
    #if 1 //set this to 0 to remove the hang
    Set this to 0 to remove those three lines of code and then build.
    it should now run fine.

@mathmanu
Copy link

mathmanu commented Aug 28, 2017

Please correct the path to your caffe.bin in the script
scripts_debug/training/cityscapes5_jsegnet21v2_2017-08-28_14-33-49/initial/run.sh
before running.

@drnikolaev
Copy link

Well, the code given is not complete:

/home/snikolaev/CODE/caffe/drnikolaev/caffe/src/caffe/layers/image_label_list_data_layer.cpp:72:13: error: ‘Slice’ in namespace ‘caffe’ does not name a type
       const caffe::Slice &label_slice, Dtype *slice_data) {
etc. etc.

The whole concept of auto_mode_ doesn't make any sense in file-based storage. This is why in my recent fix I added virtual bool auto_mode();. Please give it a try, i.e. replace the line

if (this->auto_mode_) {

with the line

if (this->auto_mode()) {

@mathmanu
Copy link

Sorry - my mistake. The following definition is required in caffe.proto to make the example compile.
message Slice {
repeated uint32 dim = 1;
repeated uint32 stride = 2;
repeated uint32 offset = 3;
}

@mathmanu
Copy link

mathmanu commented Aug 29, 2017

I looked at how you did it in image_data_layer.hpp and I added the same definition for auto_mode in my image_label_list_data_layer.hpp header file as well:
bool auto_mode() const override {
return false;
}

Now I can remove the overriding function InternalThreadEntryN() from my image_label_list_data_layer class. (That function was only added only to take care of this auto_mode issue by overriding the base class implementation).

And it works! No hang.

Thanks for the suggestion!

@drnikolaev
Copy link

Excellent!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants