Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CHORE: Use faster test translation scenario, cut CI time by ~5mins #3046

Merged
merged 8 commits into from
Jun 27, 2023

Conversation

connortann
Copy link
Collaborator

@connortann connortann commented Jun 26, 2023

Supports #3045

Overview

Changes the model used in the translation scenario to amuch smaller one that will run much faster:
https://huggingface.co/mesolitica/finetune-translation-t5-super-super-tiny-standard-bahasa-cased

Timings

The change seems to save ~5 min on Linux, and 7+ min on MacOS.

On Linux GH runner python 3.11, test timings before:

80.04s call     tests/explainers/test_partition.py::test_translation
76.44s call     tests/explainers/test_partition.py::test_translation_auto
76.27s call     tests/explainers/test_partition.py::test_translation_algorithm_arg
74.24s call     tests/explainers/test_partition.py::test_serialization
73.27s call     tests/explainers/test_partition.py::test_serialization_custom_model_save
69.42s call     tests/explainers/test_partition.py::test_serialization_no_model_or_masker

Test timings after:

22.75s call     tests/explainers/test_partition.py::test_translation
<19s   call     tests/explainers/test_partition.py::test_translation_auto
<19s   call     tests/explainers/test_partition.py::test_translation_algorithm_arg
20.38s call     tests/explainers/test_partition.py::test_serialization
20.70s call     tests/explainers/test_partition.py::test_serialization_custom_model_save
19.80s call     tests/explainers/test_partition.py::test_serialization_no_model_or_masker

Overall that's 328 seconds faster on python 3.11 🎉

Timings vary between python versions and platforms, so the overall average speedup may differ.

Note about protobuf

This new model require that we use protobuf<=3.20.x, or otherwise a TypeError is thrown. There is a related thread on stackoverflow here.

Here is the full traceback:

____________ ERROR at setup of test_serialization_custom_model_save ____________

    @pytest.mark.skipif(sys.platform == 'win32', reason="Integer division bug in HuggingFace on Windows")
    @pytest.fixture(scope="session")
    def basic_translation_scenario():
        """ Create a basic transformers translation model and tokenizer.
        """
        AutoTokenizer = pytest.importorskip("transformers").AutoTokenizer
        AutoModelForSeq2SeqLM = pytest.importorskip("transformers").AutoModelForSeq2SeqLM
    
        # Use a very small model, for speed
        name = "mesolitica/finetune-translation-t5-super-super-tiny-standard-bahasa-cased"
>       tokenizer = AutoTokenizer.from_pretrained(name)

tests/explainers/conftest.py:16: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:691: in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1825: in from_pretrained
    return cls._from_pretrained(
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1988: in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5_fast.py:133: in __init__
    super().__init__(
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:114: in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:1307: in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:445: in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/utils/sentencepiece_model_pb2.py:91: in <module>
    _descriptor.EnumValueDescriptor(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'google.protobuf.descriptor.EnumValueDescriptor'>, name = 'UNIGRAM'
index = 0, number = 1, type = None, options = None, serialized_options = None
create_key = <object object at 0x7f886b0750d0>

    def __new__(cls, name, index, number,
                type=None,  # pylint: disable=redefined-builtin
                options=None, serialized_options=None, create_key=None):
>     _message.Message._CheckCalledFromGeneratedFile()
E     TypeError: Descriptors cannot not be created directly.
E     If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
E     If you cannot immediately regenerate your protos, some other possible workarounds are:
E      1. Downgrade the protobuf package to 3.20.x or lower.
E      2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
E     
E     More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

In future, we might be able to relax this pin if the transformers library is updated, or if we find an alternative Tokenizer model that was trained with a more recent version of protobuf.

@connortann connortann added the ci Relating to Continuous Integration / GitHub Actions label Jun 26, 2023
@connortann connortann changed the title Use faster test translation scenario CHORE: Use faster test translation scenario Jun 26, 2023
@codecov
Copy link

codecov bot commented Jun 26, 2023

Codecov Report

Merging #3046 (1bb00b8) into master (9d72ec7) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #3046   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files          90      90           
  Lines       12850   12850           
======================================
  Misses      12850   12850           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@connortann connortann changed the title CHORE: Use faster test translation scenario CHORE: Use faster test translation scenario, faster tests by ~5min Jun 27, 2023
@connortann connortann added the enhancement Indicates new feature requests label Jun 27, 2023
@connortann connortann self-assigned this Jun 27, 2023
@connortann connortann changed the title CHORE: Use faster test translation scenario, faster tests by ~5min CHORE: Use faster test translation scenario Jun 27, 2023
@connortann connortann marked this pull request as ready for review June 27, 2023 10:04
@connortann connortann requested a review from thatlittleboy June 27, 2023 10:05
@connortann connortann changed the title CHORE: Use faster test translation scenario CHORE: Use faster test translation scenario, cut CI time by ~5mins Jun 27, 2023
Copy link
Collaborator

@thatlittleboy thatlittleboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

sand-dune-shipit

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ci Relating to Continuous Integration / GitHub Actions enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants