Postprocessor Improvement #75

lh5844 · 2023-09-05T15:00:47Z

This addresses the issue of the sentence piece model not correcting when two words should be together. For example, if there were two text segments with the first one ending in "with" and the second one beginning with "out", the model would identify it as two different words. However, we want the two to be together as "without", and this would involve correcting the prediction list, delays list, and elapsed list for latency accuracy.

To run the spm_detokenizer_agent.py, use this command in the SimulEval directory:

simuleval  \ 
    --user-dir examples  \
    --agent-class examples.quick_start.spm_detokenizer_agent.DummyPipeline \
    --source examples/quick_start/spm_source.txt \
    --target examples/quick_start/spm_target.txt  \
    --output tmp_output \
     --segment-k 3  \
    --sentencepiece-model examples/quick_start/tokenizer.model \
    --detokenize-only

This is the expected output for

instances.log

{"index": 0, "prediction": "Let's do it without hesitation.", "delays": [3, 6, 6, 9, 9], "elapsed": [0, 0, 0, 0, 0],
"prediction_length": 5, "reference": "Let's do it without hesitation.\n", "source": "\u2581Let ' s \u2581do 
\u2581it \u2581with out \u2581hesitation .", "source_length": 9}

metrics.tsv

LAAL	AL	AP	DAL
3.3	3.3	0.733	3.96

scores.tsv

BLEU	LAAL	AL	AP	DAL	ATD
100.0	3.3	3.3	0.733	3.96	5.0

Detokenizer agent

…us/next text segments

… black

annasun28

Overall this makes sense, thanks Lucy! Could you add the test command you used and the outputs produced to your commit summary?

annasun28 · 2023-09-06T17:54:41Z

simuleval/evaluator/instance.py

+            second_half = prediction_list[0]
+            complete_word = first_half + second_half
+            self.prediction_list.pop()
+            self.delays.pop()


Could we also pop the last from self.elapsed? It's similar to self.delays except that it also includes the actual inference time

Yup! I have made those changes now

annasun28 and others added 10 commits August 23, 2023 10:39

detokenizer agent

4db9652

check detokenizer output

9ce471d

black

483141c

run both detokenize=false/true

cd81265

Merge pull request #1 from annasun28/detokenizer_agent

532e6c1

Detokenizer agent

added new parameter to mark end of word boundary

b08bda5

attempting to correct the prediction but having trouble access previo…

fd95411

…us/next text segments

resolved merges

816108c

corrected prediction list and delay list

cedd75b

changed word boundary parameter to incomplete word and reformmated to…

6c62db1

… black

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 5, 2023

fixed format issue in instance.py

d4df959

lh5844 requested review from annasun28 and xutaima September 5, 2023 15:10

annasun28 reviewed Sep 6, 2023

View reviewed changes

lh5844 removed the request for review from xutaima September 6, 2023 18:40

lh5844 added 3 commits September 6, 2023 14:41

fixed the elapsed list for latency accuracy

7c04b74

fixed formatting space issue

21bf7f9

correcting format issue in instance.py

d3aa32f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postprocessor Improvement #75

Postprocessor Improvement #75

lh5844 commented Sep 5, 2023 •

edited

Loading

annasun28 left a comment

annasun28 Sep 6, 2023

lh5844 Sep 7, 2023

Postprocessor Improvement #75

Are you sure you want to change the base?

Postprocessor Improvement #75

Conversation

lh5844 commented Sep 5, 2023 • edited Loading

annasun28 left a comment

Choose a reason for hiding this comment

annasun28 Sep 6, 2023

Choose a reason for hiding this comment

lh5844 Sep 7, 2023

Choose a reason for hiding this comment

lh5844 commented Sep 5, 2023 •

edited

Loading