Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

tokenizer started throwing this warning, ""Truncation was not explicitely activated but max_length is provided a specific value, please use truncation=True to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior."" #5397

Closed
saahiluppal opened this issue Jun 30, 2020 · 15 comments
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@saahiluppal
Copy link

Recently while experimenting, BertTokenizer start to throw this warning

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.

I know, this warning asks to provide truncation value.
I'm asking here because this warning started this morning.

@LysandreJik
Copy link
Member

This is because we recently upgraded the library to version v3.0.0, which has an improved tokenizers API. You can either disable warnings or put truncation=True to remove that warning (as indicated in the warning).

@rainean
Copy link

rainean commented Jul 2, 2020

how do you disable the warnings for this? I'm encountering the same issue. But I don't want to set the truncation=True

@LysandreJik
Copy link
Member

LysandreJik commented Jul 2, 2020

You can disable the warnings with:

import logging
logging.basicConfig(level=logging.ERROR)

@tutmoses
Copy link

I've changed the logging level and removed max_length but am still getting this error:

WARNING:transformers.tokenization_utils_base:Truncation was not explicitely activated but max_length is provided a specific value, please use truncation=True to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.

@LysandreJik
Copy link
Member

On which version are you running? Can you try to install v3.0.2 to see if it fixes this issue?

@wise-east
Copy link

I've tried with v3.0.2 and I'm getting the same warning messages even when I changed the logging level with the code snippet above.

@thomwolf
Copy link
Member

@tutmoses @wise-east can you give us a self-contained code example reproducing the behavior?

@iamxinxin
Copy link

I got the same question

@saahiluppal
Copy link
Author

update transformers library to v3 and explicitly provide "trucation=True" while encoding text using tokenizers

@RBeaudet
Copy link

RBeaudet commented Aug 13, 2020

Could reproduce the error with this code:

from transformers.data.processors.utils import SingleSentenceClassificationProcessor
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")

texts = ["hi", "hello", "salut", "bonjour"]
labels = [0, 0, 1, 1,]

processor = SingleSentenceClassificationProcessor().create_from_examples(texts, labels)
dataset = processor.get_features(tokenizer=tokenizer)

davidalami added a commit to davidalami/transformers-tutorials that referenced this issue Aug 31, 2020
Fixes the issue similar to the mentioned in huggingface/transformers#5397, starting from transformers version v3.0.0.
@jusugac1
Copy link

jusugac1 commented Sep 1, 2020

Hello,

Using the following command had solved the problem:

import logging logging.basicConfig(level = logging.ERROR)

However, since today 15h40 (Paris time), it does not work anymore and the following warning continues to pop up until crashing Google Colab:

Truncation was not explicitely activated but max_lengthis provided a specific value, please usetruncation=Trueto explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy totruncation.

Could you please tell me how to solve it? I also tried to deactivate truncation from the encode_plus tokenizer:

encoded_dict = tokenizer.encode_plus( sent, # Sentence to encode. add_special_tokens = True, # Add '[CLS]' and '[SEP]' max_length = 128, # Pad & truncate all sentences. pad_to_max_length = True, return_attention_mask = True, # Construct attn. masks. return_tensors = 'pt', # Return pytorch tensors. truncation = False )

But it did not work.

Thank for your help/replies,

----------EDIT---------------

I modified my code in the following way by setting "truncation = True" as suggested on this post. It worked perfectly! From what I understood, this should consider the max_lenght I'm applying and avoid the warning from comming up.

encoded_dict = tokenizer.encode_plus( sent, # Sentence to encode. add_special_tokens = True, # Add '[CLS]' and '[SEP]' max_length = 128, # Pad & truncate all sentences. pad_to_max_length = True, return_attention_mask = True, # Construct attn. masks. return_tensors = 'pt', # Return pytorch tensors. truncation = True )

J.

@Kerry-zzx
Copy link

'truncation=True' solves the problem.
tokenizer = BertTokenizer.from_pretrained(cfg.text_model.pretrain)
lengths = [len(tokenizer.tokenize(c)) + 2 for c in captions]
captions_ids = [torch.LongTensor(tokenizer.encode(c, max_length=max_len, pad_to_max_length=True**, truncation=True**))
for c in captions]

@swuxyj
Copy link

swuxyj commented Nov 9, 2020

not elegant solution
modify transformers source code (~/python/site-packages/transformers/tokenization_utils_base.py) line 1751 to aviod this warning

            if 0:       #if verbose:
                logger.warning(
                    "Truncation was not explicitely activated but `max_length` is provided a specific value, "
                    "please use `truncation=True` to explicitely truncate examples to max length. "
                    "Defaulting to 'longest_first' truncation strategy. "
                    "If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy "
                    "more precisely by providing a specific strategy to `truncation`."
                )
            truncation = "longest_first"

@IamHimon
Copy link

add 'truncation=True' to tokenizer.encode_plus(truncation=True).
work to me!

@liuyang1123
Copy link

This line is effective
tokenizer.deprecation_warnings["Truncation-not-explicitly-activated"] = True

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests