Misc. fixes for Pytorch QA examples: #16958

searchivarius · 2022-04-27T08:17:45Z

Thank you for the great library! This fixes a number of issues with Pytorch QA examples. All numbers are either the same or went up. However, there are still some issues, which I wasn't able to fix (in one example). Please, see the notes and benchmark results below.

What does this PR do?

Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError Running SQuAD 1.0 sample command raises IndexError #15401 but not completely fixed).
Removes boolean arguments that don't use store_true. Please, don't use these: ANY non-empty string is being converted to True in this case. This is clearly an undesiredbehavior, which creates a LOT of confusion.
All no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions.
Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. This is a reduced PR as discussed here.
You make sure to update the documentation with your changes? I believe examples aren't covered by the documentation
Did you write any new necessary tests? I trained squad and squad v2 models and compared results (see the discussion below), but I am not sure if running more QA tests automatically will be feasible. Do note that the existing "unit-test" is very crude and does not permit detecting small regressions in model quality.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Perhaps, this can be of most interest for @sgugger, who reviewed a prior version of this PR.

Comparing old and new performance + some potential issues

Some remaining issues:

Despite the fixes & improvements, there's still a discrepancy between no-trainer and original version for SQuAD v2 or the beam-search version.
In particular, for SQuAD v2 and the beam-search variant without trainer, both old and new numbers look very wrong to me.

Please note that to be able to run SQuAD v2 tests, I had to apply utils_qa.py fixes to the old code as well. Otherwise, it would have just failed:

The metric is F1, the exact scores have the same pattern:

	previous	new
squad v1	88.4	88.4
squad v1 (no trainer)	86.7	88.5
squad v2	N/A	75.2
squad v2 (no trainer)	N/A	77.1
squad v1 (beam search)	92.1	92.1
squad v1 (beam search no trainer)	90.2	91.0
squad v2 (beam search)	83.2	83.2
squad v2 (beam search no trainer)	4.9	50.1

1. Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError huggingface#15401 but not completely fixed). 2. Removes boolean arguments that don't use store_true. Please, don't use these: *ANY non-empty string is being converted to True in this case and this clearly is not the desired behavior (and it creates a LOT of confusion). 3. All no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions. 4. Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.

HuggingFaceDocBuilderDev · 2022-04-27T08:32:41Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks a lot for all the fixes!
cc @patil-suraj (Flax) and @gante @Rocketknight1 (TF) for information (it shouldn't break any of your examples, it's just propagating a fix :-) )

1. Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError huggingface#15401 but not completely fixed). 2. Removes boolean arguments that don't use store_true. Please, don't use these: *ANY non-empty string is being converted to True in this case and this clearly is not the desired behavior (and it creates a LOT of confusion). 3. All no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions. 4. Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.

searchivarius mentioned this pull request Apr 27, 2022

Fix a bug in run_qa_no_trainer.py #16867

Closed

5 tasks

sgugger approved these changes Apr 27, 2022

View reviewed changes

sgugger merged commit c82e017 into huggingface:main Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. fixes for Pytorch QA examples: #16958

Misc. fixes for Pytorch QA examples: #16958

searchivarius commented Apr 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 27, 2022 •

edited

Loading

sgugger left a comment

Misc. fixes for Pytorch QA examples: #16958

Misc. fixes for Pytorch QA examples: #16958

Conversation

searchivarius commented Apr 27, 2022 • edited Loading

What does this PR do?

Who can review?

Comparing old and new performance + some potential issues

HuggingFaceDocBuilderDev commented Apr 27, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

searchivarius commented Apr 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 27, 2022 •

edited

Loading