caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347

mathmanu · 2017-06-07T05:33:38Z

The loss comes down slower and the final accuracy is also lower. Has anyone else observed similar issue? A friend of mine had another observation that the tendency of the loss to explode to nan is higher in caffe-0.16.

The same issue exists even if I don't use CUDNN. What could be the reason?

Thanks for your help.

CFAndy · 2017-06-08T00:11:51Z

The same trend on my side

drnikolaev · 2017-06-08T17:42:17Z

hi @mathmanu @ChenFengAndy
what particular nets and datasets? Do you use Python layers?

CFAndy · 2017-06-09T00:08:35Z

Mine is Resnet50, no python layers.

drnikolaev · 2017-06-09T06:36:54Z

@ChenFengAndy do you observe the issue using muli-GPU setup? If so, do you use NVLink or straight PCIe?

CFAndy · 2017-06-09T15:30:06Z

Yes. NVLINK.

mathmanu · 2017-06-09T16:36:49Z

I don't use NVLINK. Only PCIe, two GTX1080 cards. I had this observation on image classification and segmentation networks.

When I saw the problem I was curious whether its related to multi-GPU, so I ran training with single GPU. If I recall correctly, the trend was similar there as well - but I am not completely sure now.

@ChenFengAndy, can you start the training with one GPU and see if the trend is similar with that?

drnikolaev · 2017-06-09T16:57:14Z

@mathmanu @ChenFengAndy thank you. I'll need some time to verify this. So far, quick AlexNet+ImageNet+cuDNN_v6+DGX-1 comparison between 0.15 and 0.16 shows that 0.16 trains it almost two times faster. We also observe performance boost on other nets.
May I bother you to paste NVCaffe logs here (both 0.15 and 0.16)? That would help a lot.

mathmanu · 2017-06-09T17:27:19Z

May be there is a miscommunication. I was talking about the loss and accuracy. Not about speed.

drnikolaev · 2017-06-09T20:09:23Z

@mathmanu yeah, thanks for pointing to this! We actually have some accuracy&determinism improvements in the pipeline, you can give it a try here: https://github.com/drnikolaev/caffe/tree/caffe-0.16
If it's still not satisfactory please attach logs to this issue.

mathmanu · 2017-06-12T18:01:58Z

Thanks. I am working on it.

mathmanu · 2017-06-15T05:30:11Z

I have attached training logs that explain this issue.
nvidia-caffe-issue-347-v1.zip
Please see the train.log files. I tried both classification and segmentation scenarios.

Following are the results:

imagenet classification - top-1 accuracy:

nvidia/caffe(caffe-0.15) 2-gpu: 60.89%
drnikolaev/caffe(caffe-0.16) 2-gpu: 57.62%

Conclusion: caffe-0.16 achieves lower classification accuracy.

cityscapes segmentation - pixel accuracy trend after 2000 iterations:

nvidia/caffe(caffe-0.15) 2-gpu: 90.54%
drnikolaev/caffe(caffe-0.16) 2-gpu: 88.20%
drnikolaev/caffe(caffe-0.16) 1-gpu: 89.53%

I also have (but not attached) the full training logs for some (but not all) of the above segmentation scenarios which shows lower final accuracy in caffe-0.16.

Conclusion: the training loss drops down very slowly in caffe-0.16 and the final segmentation accuracy achieved is also lower.

(For segmentaion, I used a custom ImageLabelData layer - especially needed in caffe-0.15, which did not have fixed random seed for the DataLayer - source code for the new layer is also included in the attached zip file).

Let me know if you need any other information.

Btw, thankyou for all the great work that you are doing - I get about 25% speedup when using caffe-0.16.

drnikolaev · 2017-06-15T06:22:14Z

Hi @mathmanu thank you very much for detailed report. You are right, accuracy first and we do test it. Seems like we missed something here. Marked as a bug, work in progress...

mathmanu · 2017-06-15T06:25:11Z

Thanks. Kindly review my ImageLabelData layer as well and let me know if I missed anything.

mathmanu · 2017-06-15T07:29:52Z

I just noticed that the BatchNorm parameters that used for the logs that I shared are not correct for caffe-0.16 (which needs slightly different parameters).

I will correct these and give a run - but it takes too much time for me to train as I have just 2 1080s - if you could try it in your DGX1, after correcting BN params, that will be great.

I have noticed the issue even if I use the correct BN parameters.

drnikolaev · 2017-06-15T07:35:05Z

Is this similar: #276 (comment) ?
@borisgin could you have a look please?
@mathmanu sure, i'll run it tomorrow.

mathmanu · 2017-06-15T08:24:33Z

Hold on - I will update the results with corrected params tomorrow.

mathmanu · 2017-06-16T03:51:14Z

I have re-run the simulations after correcting the params for new BN. The issue is very much there and the conclusions remain unchanged.

imagenet classification - top-1 accuracy:
nvidia/caffe(caffe-0.15) 2-gpu: 60.89%
drnikolaev/caffe(caffe-0.16) 2-gpu: 57.56%
Conclusion: caffe-0.16 achieves lower classification accuracy.

cityscapes segmentation - pixel accuracy trend after 2000 iterations:
nvidia/caffe(caffe-0.15) 2-gpu: 90.54%
drnikolaev/caffe(caffe-0.16) 2-gpu: 88.43%
Conclusion: the training loss drops down very slowly in caffe-0.16 and the final segmentation accuracy achieved is also lower.

The logs are in train.log files in the following attachment:
nvidia-caffe-issue-347-v2.zip

Looking forward for a solution. Thanks.

cliffwoolley · 2017-06-16T03:55:36Z

Thanks for the report, @mathmanu . We're looking into this. Best, Cliff /cc @thatguymike @slayton58

drnikolaev · 2017-06-20T05:17:32Z

@mathmanu @ChenFengAndy - we have reproduced and fixed the issue. Thanks again for reporting it. We are working on a new release now but if you want to get early access to the fix, please clone https://github.com/drnikolaev/caffe/tree/caffe-0.16 - it's still under construction but it does produce the same accuracy as 0.15 (at least on those nets we tested so far), like this one:

mathmanu · 2017-06-20T08:05:30Z

Great! I'll wait for the release.

mathmanu · 2017-06-20T13:19:46Z

As far as I understand from the fix (in BN), it only changes the output of test/validation. So if I run test with my previous model (trained in caffe-0.16 which had this bug), using the bug fixed version, i should get the expected correct accuracy - is that right?

borisgin · 2017-06-20T14:05:38Z

No. The bug was in the code where local learning rate was set for scale and bias in the BN layers. You have to retrain the model .

mathmanu · 2017-06-22T05:30:14Z

Thank you. I hope the CUDNN BN will get integrated into BVLC/caffe soon.

drnikolaev added the bug label Jun 15, 2017

drnikolaev mentioned this issue Jun 22, 2017

June 2017 release #359

Merged

drnikolaev closed this as completed in #359 Jun 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347

caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347

mathmanu commented Jun 7, 2017 •

edited

Loading

CFAndy commented Jun 8, 2017

drnikolaev commented Jun 8, 2017 •

edited

Loading

CFAndy commented Jun 9, 2017

drnikolaev commented Jun 9, 2017

CFAndy commented Jun 9, 2017

mathmanu commented Jun 9, 2017 •

edited

Loading

drnikolaev commented Jun 9, 2017

mathmanu commented Jun 9, 2017

drnikolaev commented Jun 9, 2017

mathmanu commented Jun 12, 2017

mathmanu commented Jun 15, 2017 •

edited

Loading

drnikolaev commented Jun 15, 2017

mathmanu commented Jun 15, 2017

mathmanu commented Jun 15, 2017

drnikolaev commented Jun 15, 2017 •

edited

Loading

mathmanu commented Jun 15, 2017 •

edited

Loading

mathmanu commented Jun 16, 2017

cliffwoolley commented Jun 16, 2017 via email •

edited

Loading

drnikolaev commented Jun 20, 2017

mathmanu commented Jun 20, 2017

mathmanu commented Jun 20, 2017 •

edited

Loading

borisgin commented Jun 20, 2017

mathmanu commented Jun 22, 2017

caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347

caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347

Comments

mathmanu commented Jun 7, 2017 • edited Loading

CFAndy commented Jun 8, 2017

drnikolaev commented Jun 8, 2017 • edited Loading

CFAndy commented Jun 9, 2017

drnikolaev commented Jun 9, 2017

CFAndy commented Jun 9, 2017

mathmanu commented Jun 9, 2017 • edited Loading

drnikolaev commented Jun 9, 2017

mathmanu commented Jun 9, 2017

drnikolaev commented Jun 9, 2017

mathmanu commented Jun 12, 2017

mathmanu commented Jun 15, 2017 • edited Loading

drnikolaev commented Jun 15, 2017

mathmanu commented Jun 15, 2017

mathmanu commented Jun 15, 2017

drnikolaev commented Jun 15, 2017 • edited Loading

mathmanu commented Jun 15, 2017 • edited Loading

mathmanu commented Jun 16, 2017

cliffwoolley commented Jun 16, 2017 via email • edited Loading

drnikolaev commented Jun 20, 2017

mathmanu commented Jun 20, 2017

mathmanu commented Jun 20, 2017 • edited Loading

borisgin commented Jun 20, 2017

mathmanu commented Jun 22, 2017

mathmanu commented Jun 7, 2017 •

edited

Loading

drnikolaev commented Jun 8, 2017 •

edited

Loading

mathmanu commented Jun 9, 2017 •

edited

Loading

mathmanu commented Jun 15, 2017 •

edited

Loading

drnikolaev commented Jun 15, 2017 •

edited

Loading

mathmanu commented Jun 15, 2017 •

edited

Loading

cliffwoolley commented Jun 16, 2017 via email •

edited

Loading

mathmanu commented Jun 20, 2017 •

edited

Loading