-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ImageDataLayer deadlock problem #357
Comments
Hi @smuelpeng thanks! |
I faced a similar issue when I was creating a custom layer. Finally it worked when I commented out the highlighted portion of the code under template<typename Ftype, typename Btype> |
@smuelpeng @mathmanu This was reproduced and fixed. Pre-release: https://github.com/drnikolaev/caffe/tree/caffe-0.16 |
@drnikolaev I can see that you have added a fix in DataLayer. But the layer that I am using is similar to ImageDataLayer. It derives directly from BasePrefetchingDataLayer So the fix that you applied doesn't help me - it still hangs. Shouldn't you be applying the fix to BasePrefetchingDataLayer since DataLayer and ImageDataLayer are derives from this class? |
@mathmanu The hang I reproduced was fixed by this commit: drnikolaev@232d38b#diff-8c16c57bbe3538ff698add14cb67a7dd Seems like there is something else. May I ask you to give me self-contained sample? |
scripts_debug_nvcaffe_issue357_v1.zip Attachment is a self contained example.
|
Please correct the path to your caffe.bin in the script |
Well, the code given is not complete:
The whole concept of auto_mode_ doesn't make any sense in file-based storage. This is why in my recent fix I added
with the line
|
Sorry - my mistake. The following definition is required in caffe.proto to make the example compile. |
I looked at how you did it in image_data_layer.hpp and I added the same definition for auto_mode in my image_label_list_data_layer.hpp header file as well: Now I can remove the overriding function InternalThreadEntryN() from my image_label_list_data_layer class. (That function was only added only to take care of this auto_mode issue by overriding the base class implementation). And it works! No hang. Thanks for the suggestion! |
Excellent! |
Hello,I've used nvidia-caffe for months, and I clone the newest version caff-0.16 this week.
But I found that image data layer in nvidiacaffe can't work well but DataLayer with LMDB works as usual (0.5 time faster in our K40 servers,that's very cool).
When I try imagedatalayer in a simple 5-class classification work by using resnet18,prototxt like:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
transform_param {
scale:1
crop_size: 224
}
image_data_param {
source: "data/age_5/train.txt"
batch_size: 64
mirror: true
shuffle: true
new_height: 250
new_width: 250
root_folder: "/home/yuzhipeng/data/"
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 3
kernel_size: 7
stride: 2
weight_filler {
type: "xavier"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
·····
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "classifier"
bottom: "label"
top: "loss"
}
Well,I found that after first batch, work remain stagnant in:
4510 I0618 23:22:02.069660 28888 image_data_layer.cpp:79] output data size: 32, 3,224, 224
and I debug base_data_layer.cpp by setting:
I found that " qid:0 queryid:0" has printed only once.
So, I suspect that you may reuse the imagedatalayer thread by more than one time by using shared_ptr unproprtly.
or someting like lock and didn't unlock it. or something else.
I hope you can check this issue or point some mistake I've made.
Thank you!
The text was updated successfully, but these errors were encountered: