How to do calculate all bleu scores during evaluvaion #37

HarryDresden1 · 2019-01-20T17:55:51Z

Hi,

Thanks for the well-documented code and tutorial. I trained my model from scratch using your code now when I want to evaluate am not sure how to get all the BLEU scores, not just bleu4 as currently in the eval.py.

kmario23 · 2019-01-21T01:50:07Z

The current codebase uses NLTK to calculate BLEU-4 scores. However, BLEU-1 to BLEU-n can be easily implemented, if you want to do that yourself. If you don't want to do that, you can then simply use NLTK for doing this which provides a nice interface to achieve this. (see code below)

Here is the explanation of how BLEU score computation is defined:

BLEU-n is just the geometric average of the n-gram precision.

(precisely it's string matching, at different n-gram levels, between references and hypotheses; that's why there has been much criticism on this metric. But, people still use it anyways because it has stuck with the community for ages)

For example, BLEU-1 is simply the unigram precision, BLEU-2 is the geometric average of unigram and bigram precision, BLEU-3 is the geometric average of unigram, bigram, and trigram precision and so on.

Having said that, if you want to compute specific n-gram BLEU scores, you have to pass a weights parameter when you call corpus_bleu . Note that if you ignore passing this weights parameter, then by default BLEU-4 scores are returned, which is what happening in the evaluation here.

To compute, BLEU-1 you can call copus_bleu with weights as

weights = (1.0/1.0, )
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-2 you can call corpus_bleu with weights as

weights=(1.0/2.0, 1.0/2.0,)
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-3 you can call corpus_bleu with weights as

weights=(1.0/3.0, 1.0/3.0, 1.0/3.0,)
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-5 you can call corpus_bleu with weights as

weights=(1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0,)
corpus_bleu(references, hypotheses, weights)

Here is a demonstration using a toy example adapted from NLTK webpage:

Note how the BLEU score keeps decreasing as we increase the number n in n-grams using the weights parameter. Also, note how not passing the weights parameter yields the same score as passing a weights parameter for quadrigram because that's the default weight NLTK passes, if we don't pass one.

Refer this page for more information on the NLTK BLEU score implementation

HarryDresden1 · 2019-01-21T12:33:17Z

Thank you very much for your explanation

kmario23 · 2019-01-21T13:18:44Z

@sgrvinod Maybe is it a good idea that we incorporate this into the documentation somewhere? Because someone would want such a comprehensive evaluation to report.

Do you think Remarks section be an apt place?

kmario23 · 2019-03-16T13:22:36Z

@sgrvinod ping!

sgrvinod · 2019-03-16T15:28:00Z

Oops, didn't see this. Yes, it's a good idea, I'll add it tomorrow with credit to you, thanks!

I think the entire detailed explanation is too long for the Remarks section. I'll either link to your post here from the Remarks section, or add a question to the FAQ with your answer (and crediting you), or both. You could also submit a pull request if you wish, and I'll make minor edits to it if needed.

kmario23 · 2019-03-18T00:49:29Z

@sgrvinod done!

sgrvinod · 2019-03-18T03:46:17Z

Merged #52.

forence · 2019-03-22T09:55:05Z

@kmario23 Thanks for your brilliant explanation, I got the proceprocessing to calculate BLEU by NLTK. But I still confuse that if I have 3 References and only 1 Hypotheses, does the tool calculate <ref, hyp> pairs one by one？ And then gets the mean value of them or get the maximum value of them?

kmario23 · 2019-03-22T20:35:17Z

Hello @forence, thanks! Contrary to our intuition, it's not how the BLEU score is computed. However, luckily the paper that proposed BLEU is quite very well written (and easy to understand). Please have a look at Section 2 of BLEU: a Method for Automatic Evaluation of Machine Translation for how they compute a Modified Unigram Precision, which is better than simple precision.

HarryDresden1 closed this as completed Jan 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do calculate all bleu scores during evaluvaion #37

How to do calculate all bleu scores during evaluvaion #37

HarryDresden1 commented Jan 20, 2019

kmario23 commented Jan 21, 2019 •

edited

Loading

HarryDresden1 commented Jan 21, 2019

kmario23 commented Jan 21, 2019

kmario23 commented Mar 16, 2019

sgrvinod commented Mar 16, 2019 •

edited

Loading

kmario23 commented Mar 18, 2019

sgrvinod commented Mar 18, 2019

forence commented Mar 22, 2019

kmario23 commented Mar 22, 2019 •

edited

Loading

How to do calculate all bleu scores during evaluvaion #37

How to do calculate all bleu scores during evaluvaion #37

Comments

HarryDresden1 commented Jan 20, 2019

kmario23 commented Jan 21, 2019 • edited Loading

HarryDresden1 commented Jan 21, 2019

kmario23 commented Jan 21, 2019

kmario23 commented Mar 16, 2019

sgrvinod commented Mar 16, 2019 • edited Loading

kmario23 commented Mar 18, 2019

sgrvinod commented Mar 18, 2019

forence commented Mar 22, 2019

kmario23 commented Mar 22, 2019 • edited Loading

kmario23 commented Jan 21, 2019 •

edited

Loading

sgrvinod commented Mar 16, 2019 •

edited

Loading

kmario23 commented Mar 22, 2019 •

edited

Loading