Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

WRONG metrics implementation when calculating the image level F1 score! #30

Closed
SunnyHaze opened this issue Apr 18, 2023 · 6 comments
Closed

Comments

@SunnyHaze
Copy link

SunnyHaze commented Apr 18, 2023

There is a severe error when calculating the image level F1 score. This question can be exactly found here:

MVSS-Net/common/utils.py

Lines 44 to 45 in cc2aed7

spe = true_neg / (true_neg + false_pos + 1e-6)
f1 = 2 * sen * spe / (sen + spe)

The correct metric would be F1 = 2 * true_pos / (2 * true_pos + false_pos + false_neg + eps), where eps is a numerical stability factor.

Or calculate with the Recall in the coming formula instead of the Specificity:
image
image

And your implementation for pixel level F1 score is correct, you could refer to this part of the codes:

MVSS-Net/common/utils.py

Lines 57 to 59 in cc2aed7

f1 = 2 * true_pos / (2 * true_pos + false_pos + false_neg + 1e-6)
precision = true_pos / (true_pos + false_pos + 1e-6)
recall = true_pos / (true_pos + false_neg + 1e-6)

I hope you can recalculate and revise the metrics of the image level F1 score results, and at least publish a patch Table of metrics on GitHub, otherwise, it will be very unfair for future research, as this calculation will bring higher results to your F1 score values.

To create a better research environment, we hope that you can value research integrity! Thank you very much for your inspiring work!!

@kostino
Copy link

kostino commented Apr 18, 2023

Adding on to the points made by @SunnyHaze:
Concretely the difference between the computed metric lies here:
f1mistake
As we can see depending on the distributions in the datasets that we experiment on the reported result can be way off the true value of F1. Especially in real-world scenarios where naturally TN >> TP this will produce a value significantly higher than the true F1 value

@Chenxr1999
Copy link
Collaborator

There is a severe error when calculating the image level F1 score. This question can be exactly found here:

MVSS-Net/common/utils.py

Lines 44 to 45 in cc2aed7

spe = true_neg / (true_neg + false_pos + 1e-6)
f1 = 2 * sen * spe / (sen + spe)

The correct metric would be F1 = 2 * true_pos / (2 * true_pos + false_pos + false_neg + eps), where eps is a numerical stability factor.

Or calculate with the Recall in the coming formula instead of the Specificity: image image

And your implementation for pixel level F1 score is correct, you could refer to this part of the codes:

MVSS-Net/common/utils.py

Lines 57 to 59 in cc2aed7

f1 = 2 * true_pos / (2 * true_pos + false_pos + false_neg + 1e-6)
precision = true_pos / (true_pos + false_pos + 1e-6)
recall = true_pos / (true_pos + false_neg + 1e-6)

I hope you can recalculate and revise the metrics of the image level F1 score results, and at least publish a patch Table of metrics on GitHub, otherwise, it will be very unfair for future research, as this calculation will bring higher results to your F1 score values.

To create a better research environment, we hope that you can value research integrity! Thank you very much for your inspiring work!!

As stated in Section 4.1 of the paper,

For the pixel-level manipulation detection, following previous works, we compute pixel-level precision and recall, and report their F1. For image-level manipulation detection, in order to measure the miss detection rate and false alarm rate, we report sensitivity, specifificity and their F1.

For pixel-level evaluation, in order to compare with previous works, we calculate F1 using the harmonic average of precision and recall. For image-level evaluation, F1 is calculated as the harmonic average of sensitivity and specificity, since sensitivity and specificity are not affected by the distribution of positive and negative samples in the test data, so a more reliable evaluation conclusion can be obtained.

@SunnyHaze
Copy link
Author

Thank you for your reply!

we report sensitivity, specificity and their F1.

F1 score is a specialized term used to refer to the harmonic mean of the precision and recall(WIKIPIDIA). The “F1” in the above sentence is almost impossible for most readers to consider it as the harmonic mean of sensitivity and specificity, if they have received a good mathematical education. The better choice should be described as "the harmonic mean of sensitivity and specificity".

You said this is a more reliable conclusion but did not clarify the source of this particular value and attempt to confuse it. However, if this false high indicator cannot be clearly pointed out, it will cause considerable trouble for subsequent Image Manipulation Detection reviewers, making it difficult to truly evaluate the current SOTA model. Because they only glance at the result of Image-level classification table and say, 'This is the result of that ordinary F1 score.'.

I hope you can seriously consider this issue and maintain academic honesty, and I will further investigate this matter.

@li-xirong
Copy link

this is not an issue and has nothing to do with academic honesty. Case closed.

@SunnyHaze
Copy link
Author

6

@erliufashi
Copy link

In fact, the harmonic average between sensitivity and specificity should not be used at all, but rather the average value directly, which has a special name Balanced accuracy (BA)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants