-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Postprocessor Improvement #75
base: main
Are you sure you want to change the base?
Conversation
Detokenizer agent
…us/next text segments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this makes sense, thanks Lucy! Could you add the test command you used and the outputs produced to your commit summary?
second_half = prediction_list[0] | ||
complete_word = first_half + second_half | ||
self.prediction_list.pop() | ||
self.delays.pop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also pop the last from self.elapsed
? It's similar to self.delays
except that it also includes the actual inference time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! I have made those changes now
This addresses the issue of the sentence piece model not correcting when two words should be together. For example, if there were two text segments with the first one ending in "with" and the second one beginning with "out", the model would identify it as two different words. However, we want the two to be together as "without", and this would involve correcting the prediction list, delays list, and elapsed list for latency accuracy.
To run the spm_detokenizer_agent.py, use this command in the
SimulEval
directory:simuleval \ --user-dir examples \ --agent-class examples.quick_start.spm_detokenizer_agent.DummyPipeline \ --source examples/quick_start/spm_source.txt \ --target examples/quick_start/spm_target.txt \ --output tmp_output \ --segment-k 3 \ --sentencepiece-model examples/quick_start/tokenizer.model \ --detokenize-only
This is the expected output for