You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the same source, target, reference files are evaluated using the same wmt22-comet-da checkpoint, unbabel-comet 2.2.1 under python3.9 and unbabel-comet 1.1.2 under python3.7 gave me dramatically different numbers.
To Reproduce
In python3.7, pip install --upgrade unbabel-comet gives 1.1.2 as the latest version, while in python3.9 it gives 2.2.1.
Scoring the same source, target, and reference files under the above two environments gave different scores. unbabel-comet 1.1.2 results in a score of 0.86 while the 2.2.1 version gave 0.79. I used WMT22-COMET-DA downloaded from Hugging Face https://huggingface.co/Unbabel/wmt22-comet-da.
Attaching the files which gave 0.79 and 0.86 below, but I think any file combination can be used to reproduce this behaviour since it's associated with the COMET package version. target.en.txt source.mt.txt hypothesis.en.txt
Expected behaviour
I would expect different COMET package versions to give the same score if the same checkpoint and files are given.
Environment
Managed python3.7 and python3.9 with conda.
Additional context
If there is indeed some package mismatch between unbabel-comet 1.1.2 and 2.2.1, it might be difficult to go back and fix the problem. Users probably are unaware of this and will not update. Moreover, python3.7 only supports 1.1.2 as the latest even if users upgrade COMET in python3.7. Maybe this behaviour can be highlighted in README to encourage the user to use specific Python and unbabel-comet versions . On the other hand, this could imply that research papers should report COMET package version in addition to COMET version. Would it be possible to implement some kind of COMET signature just like that in sacrebleu?
The text was updated successfully, but these errors were encountered:
This confirms what we learnt for BLEU, too: one should ALWAYS report version numbers (signatures), also for COMET!
Side note: in my MATEO, I added a custom signature for neural metrics like bertscore, bleurt and comet, too. For COMET it looks like this (inspired by sacrebleu):
where c stands for the checkpoint used and version is self-explanatory. Wasn't sure how far one had to go with this because difference in torch, cuda and transformers versions may or may not also lead to difference in results. Hell, even then the CUDA optimisation might lead to different results on different hardware.
Admittedly the README currently says it requires 3.8, so maybe I installed COMET in the stone age and pip install —upgrade unbabel-comet never warned me. Anyway I think the score mismatch should not be expected
🐛 Bug
When the same source, target, reference files are evaluated using the same wmt22-comet-da checkpoint,
unbabel-comet 2.2.1
underpython3.9
andunbabel-comet 1.1.2
underpython3.7
gave me dramatically different numbers.To Reproduce
In python3.7,
pip install --upgrade unbabel-comet
gives1.1.2
as the latest version, while in python3.9 it gives2.2.1
.Scoring the same source, target, and reference files under the above two environments gave different scores.
unbabel-comet 1.1.2
results in a score of 0.86 while the2.2.1
version gave 0.79. I used WMT22-COMET-DA downloaded from Hugging Face https://huggingface.co/Unbabel/wmt22-comet-da.Attaching the files which gave 0.79 and 0.86 below, but I think any file combination can be used to reproduce this behaviour since it's associated with the COMET package version.
target.en.txt
source.mt.txt
hypothesis.en.txt
Expected behaviour
I would expect different COMET package versions to give the same score if the same checkpoint and files are given.
Environment
Managed python3.7 and python3.9 with conda.
Additional context
If there is indeed some package mismatch between unbabel-comet 1.1.2 and 2.2.1, it might be difficult to go back and fix the problem. Users probably are unaware of this and will not update. Moreover, python3.7 only supports 1.1.2 as the latest even if users upgrade COMET in python3.7. Maybe this behaviour can be highlighted in README to encourage the user to use specific Python and unbabel-comet versions . On the other hand, this could imply that research papers should report COMET package version in addition to COMET version. Would it be possible to implement some kind of COMET signature just like that in
sacrebleu
?The text was updated successfully, but these errors were encountered: