Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' #8739

Closed
2 of 4 tasks
zcain117 opened this issue Nov 23, 2020 · 13 comments
Closed
2 of 4 tasks

AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' #8739

zcain117 opened this issue Nov 23, 2020 · 13 comments

Comments

@zcain117
Copy link
Contributor

Environment info

  • transformers version: 4.0.0-rc-1
  • Platform: Linux-4.9.0-14-amd64-x86_64-with-debian-9.13
  • Python version: 3.6.10
  • PyTorch version (GPU?): 1.8.0a0+4ed7f36 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: 8-core TPU training
  • Using TPU

Who can help

albert, bert, GPT2, XLM: @LysandreJik

Information

Model I am using (Bert, XLNet ...): bert and roberta

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: mlm
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

2 examples of failing commands:

E 2020-11-18T17:38:08.657584093Z python examples/xla_spawn.py \
E 2020-11-18T17:38:08.657588780Z   --num_cores 8 \
E 2020-11-18T17:38:08.657593609Z   examples/contrib/legacy/run_language_modeling.py \
E 2020-11-18T17:38:08.657598646Z   --logging_dir ./tensorboard-metrics \
E 2020-11-18T17:38:08.657604088Z   --cache_dir ./cache_dir \
E 2020-11-18T17:38:08.657609492Z   --train_data_file /datasets/wikitext-103-raw/wiki.train.raw \
E 2020-11-18T17:38:08.657614614Z   --do_train \
E 2020-11-18T17:38:08.657619772Z   --do_eval \
E 2020-11-18T17:38:08.657624531Z   --eval_data_file /datasets/wikitext-103-raw/wiki.valid.raw \
E 2020-11-18T17:38:08.657629731Z   --overwrite_output_dir \
E 2020-11-18T17:38:08.657641827Z   --output_dir language-modeling \
E 2020-11-18T17:38:08.657647203Z   --logging_steps 100 \
E 2020-11-18T17:38:08.657651823Z   --save_steps 3000 \
E 2020-11-18T17:38:08.657656739Z   --overwrite_cache \
E 2020-11-18T17:38:08.657661282Z   --tpu_metrics_debug \
E 2020-11-18T17:38:08.657667598Z  --mlm --model_type=bert \
E 2020-11-18T17:38:08.657672545Z --model_name_or_path bert-base-cased \
E 2020-11-18T17:38:08.657677441Z --num_train_epochs 3 \
E 2020-11-18T17:38:08.657682320Z --per_device_train_batch_size 16 \
E 2020-11-18T17:38:08.657687053Z --per_device_eval_batch_size 16
2020-11-18T17:51:49.357234955Z Traceback (most recent call last):
 E 
2020-11-18T17:51:49.357239554Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
 E 
2020-11-18T17:51:49.357245350Z     _start_fn(index, pf_cfg, fn, args)
 E 
2020-11-18T17:51:49.357249851Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
 E 
2020-11-18T17:51:49.357254654Z     fn(gindex, *args)
 E 
2020-11-18T17:51:49.357272443Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 359, in _mp_fn
 E 
2020-11-18T17:51:49.357277658Z     main()
 E 
2020-11-18T17:51:49.357281928Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 279, in main
 E 
2020-11-18T17:51:49.357287863Z     data_args.block_size = tokenizer.max_len
 E 
2020-11-18T17:51:49.357292355Z AttributeError: 'BertTokenizerFast' object has no attribute 'max_len'
 E 
E 2020-11-18T06:47:53.910306819Z python examples/xla_spawn.py \
E 2020-11-18T06:47:53.910310176Z   --num_cores 8 \
E 2020-11-18T06:47:53.910314263Z   examples/contrib/legacy/run_language_modeling.py \
E 2020-11-18T06:47:53.910319173Z   --logging_dir ./tensorboard-metrics \
E 2020-11-18T06:47:53.910322683Z   --cache_dir ./cache_dir \
E 2020-11-18T06:47:53.910325895Z   --train_data_file /datasets/wikitext-103-raw/wiki.train.raw \
E 2020-11-18T06:47:53.910329170Z   --do_train \
E 2020-11-18T06:47:53.910332491Z   --do_eval \
E 2020-11-18T06:47:53.910335626Z   --eval_data_file /datasets/wikitext-103-raw/wiki.valid.raw \
E 2020-11-18T06:47:53.910340314Z   --overwrite_output_dir \
E 2020-11-18T06:47:53.910343710Z   --output_dir language-modeling \
E 2020-11-18T06:47:53.910347004Z   --logging_steps 100 \
E 2020-11-18T06:47:53.910350089Z   --save_steps 3000 \
E 2020-11-18T06:47:53.910353259Z   --overwrite_cache \
E 2020-11-18T06:47:53.910356297Z   --tpu_metrics_debug \
E 2020-11-18T06:47:53.910359351Z  --mlm --model_type=roberta \
E 2020-11-18T06:47:53.910362484Z --tokenizer=roberta-base \
E 2020-11-18T06:47:53.910365650Z --num_train_epochs 5 \
E 2020-11-18T06:47:53.910368797Z --per_device_train_batch_size 8 \
E 2020-11-18T06:47:53.910371843Z --per_device_eval_batch_size 8
2020-11-18T06:48:27.357394365Z Traceback (most recent call last):
 E 
2020-11-18T06:48:27.357399685Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
 E 
2020-11-18T06:48:27.357405353Z     _start_fn(index, pf_cfg, fn, args)
 E 
2020-11-18T06:48:27.357426600Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
 E 
2020-11-18T06:48:27.357448514Z     fn(gindex, *args)
 E 
2020-11-18T06:48:27.357454250Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 359, in _mp_fn
 E 
2020-11-18T06:48:27.357460262Z     main()
 E 
2020-11-18T06:48:27.357465843Z   File "/transformers/examples/contrib/legacy/run_language_modeling.py", line 279, in main
 E 
2020-11-18T06:48:27.357471227Z     data_args.block_size = tokenizer.max_len
 E 
2020-11-18T06:48:27.357477576Z AttributeError: 'RobertaTokenizerFast' object has no attribute 'max_len'
 E 

The timing of this issue lines up with #8586
Tests started failing on the evening of Nov 17, a few hours after that PR was submitted

@LysandreJik
Copy link
Member

It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.

Is it possible for you to switch to one of these newer scripts? If not, the fix is to change max_len to model_max_length. We welcome PRs to fix it, but we won't be maintaining that script ourselves as there exists better alternatives now (which run on TPU too 🙂)

@zcain117
Copy link
Contributor Author

Thanks for taking a look! I will try out the new script

@zcain117
Copy link
Contributor Author

The new runner is working for us on TPUs. Thanks again for the tip!

@Shafi2016
Copy link

Shafi2016 commented Nov 25, 2020

Hello, Everything was a few days. I am getting the same error " data_args.block_size = min(data_args.block_size, tokenizer.max_len)
AttributeError: 'RobertaTokenizerFast' object has no attribute 'max_len".

I can't switch to a new script as you mentioned. Kindly help me with this error. I do not know how to fix it. Here is my chunk of codes.


`!python "/content/transformers/examples/contrib/legacy/run_language_modeling.py" \
    --output_dir "/content/drive/MyDrive/Vancouver" \
    --model_name_or_path roberta-base \
    --do_train \
    --per_gpu_train_batch_size 8 \
    --seed 42 \
    --train_data_file "/content/input_textOC.txt" \
    --block_size 256 \
    --line_by_line \
    --learning_rate 6e-4 \
    --num_train_epochs 3 \
    --save_total_limit 2 \
    --save_steps 200 \
    --weight_decay 0.01 \
     --mlm`

@LysandreJik
Copy link
Member

It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.

Is it possible for you to switch to one of these newer scripts? If not, the fix is to change max_len to model_max_length. We welcome PRs to fix it, but we won't be maintaining that script ourselves as there exists better alternatives now (which run on TPU too )

The fix is mentioned above:

fix is to change max_len to model_max_length

@LysandreJik
Copy link
Member

If you cannot switch scripts, I recommend pinning the library. You're having this error because you're using a legacy script with a master version that is not compatible.

You could pin it to v3.5.1.

@Shafi2016
Copy link

Thanks, I appreciate your response. However, I am still a basic learner. Can you please explain it a bit? how to pin it to v3.5.1.. Is it mean to use the old version of huggingface.?

@LysandreJik
Copy link
Member

If you wish to stick to that deprecated example, yes! You can do so by checking out the tag v3.5.1:

git checkout v3.5.1

If you have installed transformers from pypi (and not from source), you should also update your transformers version:

pip install -U transformers==3.5.1

Please note that the script won't be in "/content/transformers/examples/contrib/legacy/run_language_modeling.py" anymore, but in "/content/transformers/examples/language-modeling/run_language_modeling.py"

LysandreJik pushed a commit that referenced this issue Feb 24, 2021
This is to fix deprecated reference to `tokenizer.max_len` with `tokenizer.model_max_length` - similar to [issue 8739](#8739) and [PR 8604](#8604). 
Example [here](https://colab.research.google.com/gist/poedator/f8776349e5c625ce287fc6fcd312fa1e/tokenizer-max_len-error-in-transformers_glue.ipynb). The error happens when `glue_convert_examples_to_features` is called without `max_length` parameter specified. In that case line 119 with wrong reference gets called. This simple fix should  do it.
@jayachaturvedi
Copy link

It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.

Hello, I am facing the same issue with run_language_modeling.py (and more). Where can I find this new file language-modeling/run_{clm, plm, mlm}.py? Thanks!

@LysandreJik
Copy link
Member

@jayachaturvedi
Copy link

Thank you!

@mzhadigerov
Copy link

It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.

Is it possible for you to switch to one of these newer scripts? If not, the fix is to change max_len to model_max_length. We welcome PRs to fix it, but we won't be maintaining that script ourselves as there exists better alternatives now (which run on TPU too 🙂)

Change max_len to model_max_length where?

@riaraju
Copy link

riaraju commented Jul 22, 2024

fix is to change max_len to model_max_length

in which all model files should this be fixed ?

JasperD-UGent added a commit to JasperD-UGent/diaparser that referenced this issue Dec 10, 2024
Fix deprecated argument that was removed (see huggingface/transformers#8739).
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants