-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
WRONG metrics implementation when calculating the image level F1 score! #30
Comments
Adding on to the points made by @SunnyHaze: |
As stated in Section 4.1 of the paper,
For pixel-level evaluation, in order to compare with previous works, we calculate F1 using the harmonic average of precision and recall. For image-level evaluation, F1 is calculated as the harmonic average of sensitivity and specificity, since sensitivity and specificity are not affected by the distribution of positive and negative samples in the test data, so a more reliable evaluation conclusion can be obtained. |
Thank you for your reply!
F1 score is a specialized term used to refer to the harmonic mean of the precision and recall(WIKIPIDIA). The “F1” in the above sentence is almost impossible for most readers to consider it as the harmonic mean of sensitivity and specificity, if they have received a good mathematical education. The better choice should be described as "the harmonic mean of sensitivity and specificity". You said this is a more reliable conclusion but did not clarify the source of this particular value and attempt to confuse it. However, if this false high indicator cannot be clearly pointed out, it will cause considerable trouble for subsequent Image Manipulation Detection reviewers, making it difficult to truly evaluate the current SOTA model. Because they only glance at the result of Image-level classification table and say, 'This is the result of that ordinary F1 score.'. I hope you can seriously consider this issue and maintain academic honesty, and I will further investigate this matter. |
this is not an issue and has nothing to do with academic honesty. Case closed. |
6 |
In fact, the harmonic average between sensitivity and specificity should not be used at all, but rather the average value directly, which has a special name Balanced accuracy (BA) |
There is a severe error when calculating the image level F1 score. This question can be exactly found here:
MVSS-Net/common/utils.py
Lines 44 to 45 in cc2aed7
The correct metric would be
F1 = 2 * true_pos / (2 * true_pos + false_pos + false_neg + eps)
, where eps is a numerical stability factor.Or calculate with the Recall in the coming formula instead of the Specificity:


And your implementation for pixel level F1 score is correct, you could refer to this part of the codes:
MVSS-Net/common/utils.py
Lines 57 to 59 in cc2aed7
I hope you can recalculate and revise the metrics of the image level F1 score results, and at least publish a patch Table of metrics on GitHub, otherwise, it will be very unfair for future research, as this calculation will bring higher results to your F1 score values.
To create a better research environment, we hope that you can value research integrity! Thank you very much for your inspiring work!!
The text was updated successfully, but these errors were encountered: