Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces an optimized version of the T5 and Marian transformer annotators
Description
The PR contains the following changes:
NOTE: I've added caching support for TF as I already had the code and it was easy to add. However the ONNX version is always faster and it makes no sense to export new models (or re-export existing ones) to TF, with or without caching. The existing TF models can be run by the optimized annotator but they can't benefit from caching (they need to be re-exported). Let me know if you think there is no point in having the TF caching functionality and I will removed it (it is about 10-20 lines of code).
The code is restructured so that the Tensorflow and the ONNX specific code is in separate classes and general logic is shared.
Both the ONNX implementations are using caching.
Notebooks for exporting T5 models from HuggingFace:
TF:
https://colab.research.google.com/drive/1hQR9OgVG0cWbcem05Fm0Bo3cWVJNP261?usp=sharing
ONNX:
https://colab.research.google.com/drive/1l9KDgpYnbnqKSVjImKtmy2pi_7HcXGx9?usp=sharing
Notebook for exporting ONNX Marian models from HuggingFace:
https://colab.research.google.com/drive/1Cf0gBZuGMe--OYGftL1G3DNDj26I5_1I?usp=sharing
A couple of new params are added to the T5 annotator:
maxNewTokens: the max number of tokens to be generated (default is 512)
stopAtEos: whether to stop generating when the Eos token is encountered (default is True)
The generation continues until one of the following conditions is met:
T5 transformer: There is a new params which is used only internally: useCache. Set useCache should only be used when exporting the model (see TF notebook)
Motivation and Context
T5 and Marian performance can be significantly improved using ONNX or TF with caching.
How Has This Been Tested?
I've test both existing and new models in Python and Scala.
Screenshots (if appropriate):
Types of changes
Checklist: