-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347
Comments
The same trend on my side |
hi @mathmanu @ChenFengAndy |
Mine is Resnet50, no python layers. |
@ChenFengAndy do you observe the issue using muli-GPU setup? If so, do you use NVLink or straight PCIe? |
Yes. NVLINK. |
I don't use NVLINK. Only PCIe, two GTX1080 cards. I had this observation on image classification and segmentation networks. When I saw the problem I was curious whether its related to multi-GPU, so I ran training with single GPU. If I recall correctly, the trend was similar there as well - but I am not completely sure now. @ChenFengAndy, can you start the training with one GPU and see if the trend is similar with that? |
@mathmanu @ChenFengAndy thank you. I'll need some time to verify this. So far, quick AlexNet+ImageNet+cuDNN_v6+DGX-1 comparison between 0.15 and 0.16 shows that 0.16 trains it almost two times faster. We also observe performance boost on other nets. |
May be there is a miscommunication. I was talking about the loss and accuracy. Not about speed. |
@mathmanu yeah, thanks for pointing to this! We actually have some accuracy&determinism improvements in the pipeline, you can give it a try here: https://github.com/drnikolaev/caffe/tree/caffe-0.16 |
Thanks. I am working on it. |
I have attached training logs that explain this issue. Following are the results: imagenet classification - top-1 accuracy:
Conclusion: caffe-0.16 achieves lower classification accuracy. cityscapes segmentation - pixel accuracy trend after 2000 iterations:
I also have (but not attached) the full training logs for some (but not all) of the above segmentation scenarios which shows lower final accuracy in caffe-0.16. Conclusion: the training loss drops down very slowly in caffe-0.16 and the final segmentation accuracy achieved is also lower. (For segmentaion, I used a custom ImageLabelData layer - especially needed in caffe-0.15, which did not have fixed random seed for the DataLayer - source code for the new layer is also included in the attached zip file). Let me know if you need any other information. Btw, thankyou for all the great work that you are doing - I get about 25% speedup when using caffe-0.16. |
Hi @mathmanu thank you very much for detailed report. You are right, accuracy first and we do test it. Seems like we missed something here. Marked as a bug, work in progress... |
Thanks. Kindly review my ImageLabelData layer as well and let me know if I missed anything. |
I just noticed that the BatchNorm parameters that used for the logs that I shared are not correct for caffe-0.16 (which needs slightly different parameters). I will correct these and give a run - but it takes too much time for me to train as I have just 2 1080s - if you could try it in your DGX1, after correcting BN params, that will be great. I have noticed the issue even if I use the correct BN parameters. |
Is this similar: #276 (comment) ? |
Hold on - I will update the results with corrected params tomorrow. |
I have re-run the simulations after correcting the params for new BN. The issue is very much there and the conclusions remain unchanged. imagenet classification - top-1 accuracy: cityscapes segmentation - pixel accuracy trend after 2000 iterations: The logs are in train.log files in the following attachment: Looking forward for a solution. Thanks. |
@mathmanu @ChenFengAndy - we have reproduced and fixed the issue. Thanks again for reporting it. We are working on a new release now but if you want to get early access to the fix, please clone https://github.com/drnikolaev/caffe/tree/caffe-0.16 - it's still under construction but it does produce the same accuracy as 0.15 (at least on those nets we tested so far), like this one: |
Great! I'll wait for the release. |
As far as I understand from the fix (in BN), it only changes the output of test/validation. So if I run test with my previous model (trained in caffe-0.16 which had this bug), using the bug fixed version, i should get the expected correct accuracy - is that right? |
No. The bug was in the code where local learning rate was set for scale and bias in the BN layers. You have to retrain the model . |
Thank you. I hope the CUDNN BN will get integrated into BVLC/caffe soon. |
The loss comes down slower and the final accuracy is also lower. Has anyone else observed similar issue? A friend of mine had another observation that the tendency of the loss to explode to nan is higher in caffe-0.16.
The same issue exists even if I don't use CUDNN. What could be the reason?
Thanks for your help.
The text was updated successfully, but these errors were encountered: