Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How to do calculate all bleu scores during evaluvaion #37

Closed
HarryDresden1 opened this issue Jan 20, 2019 · 9 comments
Closed

How to do calculate all bleu scores during evaluvaion #37

HarryDresden1 opened this issue Jan 20, 2019 · 9 comments

Comments

@HarryDresden1
Copy link

Hi,

Thanks for the well-documented code and tutorial. I trained my model from scratch using your code now when I want to evaluate am not sure how to get all the BLEU scores, not just bleu4 as currently in the eval.py.

@kmario23
Copy link
Contributor

kmario23 commented Jan 21, 2019

The current codebase uses NLTK to calculate BLEU-4 scores. However, BLEU-1 to BLEU-n can be easily implemented, if you want to do that yourself. If you don't want to do that, you can then simply use NLTK for doing this which provides a nice interface to achieve this. (see code below)

Here is the explanation of how BLEU score computation is defined:

BLEU-n is just the geometric average of the n-gram precision.

(precisely it's string matching, at different n-gram levels, between references and hypotheses; that's why there has been much criticism on this metric. But, people still use it anyways because it has stuck with the community for ages)

For example, BLEU-1 is simply the unigram precision, BLEU-2 is the geometric average of unigram and bigram precision, BLEU-3 is the geometric average of unigram, bigram, and trigram precision and so on.


Having said that, if you want to compute specific n-gram BLEU scores, you have to pass a weights parameter when you call corpus_bleu . Note that if you ignore passing this weights parameter, then by default BLEU-4 scores are returned, which is what happening in the evaluation here.

To compute, BLEU-1 you can call copus_bleu with weights as

weights = (1.0/1.0, )
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-2 you can call corpus_bleu with weights as

weights=(1.0/2.0, 1.0/2.0,)
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-3 you can call corpus_bleu with weights as

weights=(1.0/3.0, 1.0/3.0, 1.0/3.0,)
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-5 you can call corpus_bleu with weights as

weights=(1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0,)
corpus_bleu(references, hypotheses, weights)

Here is a demonstration using a toy example adapted from NLTK webpage:

bleu-n-grams

Note how the BLEU score keeps decreasing as we increase the number n in n-grams using the weights parameter. Also, note how not passing the weights parameter yields the same score as passing a weights parameter for quadrigram because that's the default weight NLTK passes, if we don't pass one.


Refer this page for more information on the NLTK BLEU score implementation

@HarryDresden1
Copy link
Author

Thank you very much for your explanation

@kmario23
Copy link
Contributor

@sgrvinod Maybe is it a good idea that we incorporate this into the documentation somewhere? Because someone would want such a comprehensive evaluation to report.

Do you think Remarks section be an apt place?

@kmario23
Copy link
Contributor

@sgrvinod ping!

@sgrvinod
Copy link
Owner

sgrvinod commented Mar 16, 2019

Oops, didn't see this. Yes, it's a good idea, I'll add it tomorrow with credit to you, thanks!

I think the entire detailed explanation is too long for the Remarks section. I'll either link to your post here from the Remarks section, or add a question to the FAQ with your answer (and crediting you), or both. You could also submit a pull request if you wish, and I'll make minor edits to it if needed.

@kmario23
Copy link
Contributor

@sgrvinod done!

@sgrvinod
Copy link
Owner

Merged #52.

@forence
Copy link

forence commented Mar 22, 2019

@kmario23 Thanks for your brilliant explanation, I got the proceprocessing to calculate BLEU by NLTK. But I still confuse that if I have 3 References and only 1 Hypotheses, does the tool calculate <ref, hyp> pairs one by one? And then gets the mean value of them or get the maximum value of them?

@kmario23
Copy link
Contributor

kmario23 commented Mar 22, 2019

Hello @forence, thanks! Contrary to our intuition, it's not how the BLEU score is computed. However, luckily the paper that proposed BLEU is quite very well written (and easy to understand). Please have a look at Section 2 of BLEU: a Method for Automatic Evaluation of Machine Translation for how they compute a Modified Unigram Precision, which is better than simple precision.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants