Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error for creating ImageNet LMDB #1

Closed
wlw208dzy opened this issue Jun 11, 2016 · 3 comments
Closed

Error for creating ImageNet LMDB #1

wlw208dzy opened this issue Jun 11, 2016 · 3 comments

Comments

@wlw208dzy
Copy link

wlw208dzy commented Jun 11, 2016

Thanks for releasing the train script for Inception-resnet-v2. When I used the code of create-imagenet-lmdb.lua, an error came out as:
Error in `PATH-OF-TORCH/install/bin/luajit': double free or corruption (!prev): 0x00007e61b30fbe10 *** 20ms Aborted (core dumped)
Would you please tell me if you have met this problem? Thanks!

@lim0606
Copy link
Owner

lim0606 commented Jun 11, 2016

@wlw208dzy

Oh... I used normal imagenet.lua. It might be the problem related to eladhoffer/lmdb.torch#8. However, I'm not sure about it since I left behind it for a long time.

As far as I remember, torch lmdb support was not stable for my case (the process was frequently died when I ran the create-imagenet-lmdb.lua); therefore, I just bought a ssd and ran the script with normal imagenet.lua

I made a bad mistake not to clean up irrelative files.

I'm sorry for inconvenience.

Best regards,

Jaehyun

@wlw208dzy
Copy link
Author

wlw208dzy commented Jun 11, 2016

@lim0606 Thanks very much for your reply. You are so kind. However, it is strange that we also bought a SSD and run the script with normal imagenet.lua. Our training speed is not stable(mainly because of the data loading time), as shown below:

| Epoch: [1][6252/8008] Time 1.856 Data 1.156 Err 4.5771 top1 86.875 top5 66.250 LR 0.045000
| Epoch: [1][6253/8008] Time 1.696 Data 0.922 Err 4.6205 top1 81.250 top5 68.125 LR 0.045000
| Epoch: [1][6254/8008] Time 6.023 Data 5.354 Err 4.1523 top1 83.125 top5 62.500 LR 0.045000
| Epoch: [1][6255/8008] Time 1.894 Data 1.209 Err 4.1505 top1 83.750 top5 63.125 LR 0.045000
| Epoch: [1][6256/8008] Time 7.237 Data 6.376 Err 4.1333 top1 83.750 top5 61.875 LR 0.045000
| Epoch: [1][6257/8008] Time 2.363 Data 1.634 Err 4.4003 top1 88.125 top5 64.375 LR 0.045000
| Epoch: [1][6258/8008] Time 3.415 Data 2.793 Err 4.2830 top1 80.000 top5 61.875 LR 0.045000
| Epoch: [1][6259/8008] Time 4.514 Data 3.685 Err 4.3248 top1 81.875 top5 66.250 LR 0.045000
| Epoch: [1][6260/8008] Time 1.063 Data 0.353 Err 4.3800 top1 87.500 top5 64.375 LR 0.045000
| Epoch: [1][6261/8008] Time 1.116 Data 0.490 Err 4.3797 top1 85.625 top5 63.750 LR 0.045000

For other configures, 4 TitanX are utilized in parallel and the batch size is 160(4@40). I noticed that your training speed is quite stable in your logs. So would you please tell me if you have other methods to improve the speed for data loading? Thanks again!

@lim0606
Copy link
Owner

lim0606 commented Jun 11, 2016

Actually, I've never tried 4 GPU setting since I only have 2 gpus. However, my friend studying with me for this work (@shuni1001) suffered from the same problem. See, facebookarchive/fb.resnet.torch#61.

I think it is a kind of general problem for 2 > gpus setting in torch.

Best regards,

Jaehyun

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants