Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Postprocessor Improvement #75

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

lh5844
Copy link
Contributor

@lh5844 lh5844 commented Sep 5, 2023

This addresses the issue of the sentence piece model not correcting when two words should be together. For example, if there were two text segments with the first one ending in "with" and the second one beginning with "out", the model would identify it as two different words. However, we want the two to be together as "without", and this would involve correcting the prediction list, delays list, and elapsed list for latency accuracy.


To run the spm_detokenizer_agent.py, use this command in the SimulEval directory:

simuleval  \ 
    --user-dir examples  \
    --agent-class examples.quick_start.spm_detokenizer_agent.DummyPipeline \
    --source examples/quick_start/spm_source.txt \
    --target examples/quick_start/spm_target.txt  \
    --output tmp_output \
     --segment-k 3  \
    --sentencepiece-model examples/quick_start/tokenizer.model \
    --detokenize-only 

This is the expected output for

  • instances.log
{"index": 0, "prediction": "Let's do it without hesitation.", "delays": [3, 6, 6, 9, 9], "elapsed": [0, 0, 0, 0, 0],
"prediction_length": 5, "reference": "Let's do it without hesitation.\n", "source": "\u2581Let ' s \u2581do 
\u2581it \u2581with out \u2581hesitation .", "source_length": 9} 
  • metrics.tsv
LAAL	AL	AP	DAL
3.3	3.3	0.733	3.96
  • scores.tsv
BLEU	LAAL	AL	AP	DAL	ATD
100.0	3.3	3.3	0.733	3.96	5.0

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 5, 2023
Copy link
Contributor

@annasun28 annasun28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense, thanks Lucy! Could you add the test command you used and the outputs produced to your commit summary?

second_half = prediction_list[0]
complete_word = first_half + second_half
self.prediction_list.pop()
self.delays.pop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also pop the last from self.elapsed? It's similar to self.delays except that it also includes the actual inference time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! I have made those changes now

@lh5844 lh5844 removed the request for review from xutaima September 6, 2023 18:40
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants