Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Averaging of CE Metrics #11

Open
ChantalMP opened this issue Aug 9, 2023 · 3 comments
Open

Averaging of CE Metrics #11

ChantalMP opened this issue Aug 9, 2023 · 3 comments

Comments

@ChantalMP
Copy link

Hi,

thanks for sharing your work.

As I understand from your paper, Table 6 in your paper reports example-based metrics (so F1 score for every report, then averaged), not micro or macro F1. Is that correct?

From where did you find out that the other papers to which you compare also use example-based F1 and not micro or macro F1?

Any hint would be appreciated.

Thanks in advance! :)

@fuying-wang
Copy link

fuying-wang commented Feb 29, 2024

Hi,

Thanks very much for the awesome work. I have the same question. It seems that the results of baselines in Table 6 are the same as the results in the original papers. While according to the code of R2Gen, it seems that they are using macro or micro-based CE metrics.

@anicolson
Copy link
Member

Hi ChantalMP and fuying-wang,

Thank you for pointing this out. Our reported results in Table 6 are indeed averaged over each example (example-based CE metrics). The results for the other methods are reported from their respective papers. We found it difficult to determine how the CE scores were averaged in the methods respective papers, as this detail was not included. Based on the fact that papers prior to R2Gen reported the used method of averaging (macro- or micro-averaging), we assumed that papers such as R2Gen not mention this meant that they were not using either. Instead we assumed they were averaging over all examples (this may have been a bad assumption).

Do alleviate this discrepancy, we made sure to report how we averaged our results. Hopefully, this can be avoided in future papers on the topic. Unfortunately, we may have made the mistake of comparing to a different averaging strategy.

If they indeed used micro- or macro-averaging, and not averaging over each example, then the micro- and macro-averaged results for CvT2DistilGPT2 can be found here: https://github.com/aehrc/cvt2distilgpt2?tab=readme-ov-file#results.

@fuying-wang
Copy link

fuying-wang commented Mar 1, 2024

Hi,

Thanks very much for detailed clarification! I also noticed that R2Gen and other papers didn't mention their averaging method, which also makes me confused. Apart from this, your code and detailed results are awesome!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants