-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
run_language_modeling.py #4
Comments
Update: I have found the replacement files (run_clm.py, run_mlm.py, run_plm.py). When running the run_clm_no_trainer.py (since I'm using mimic data to train), I get this error: ModuleNotFoundError: No module named 'datasets_modules.datasets.mimic_string' Running on colab - This is the code: !python3 'gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py' --output_dir 'gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1' --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --learning_rate 5e-5 --block_size 128 --seed 42 --dataset_config_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/config.json' --dataset_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/mimic_string.txt' Here is the full output: 2021-07-06 10:08:00.087779: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 Traceback (most recent call last): |
Hello, I am facing issues running the run_language_modeling.py script when running the example for pretraining Bio_clinicalbert using this line:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt
issue 1 - it said the tokenizer did not have an argument called max_len. this was the error:
'AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' '
Based on advice online, I updated It from 'tokenizer.max_len' to 'tokenizer.model_max_length' which seems to have resolved this issue
issue 2 - the current error message i am getting is:
'TypeError: init() got an unexpected keyword argument 'tui_ids''
When looking for answers to these online, I came across a comment on the huggingface transformers issue forum at huggingface/transformers#8739
They said - 'It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.'
Does this apply to the scrpit for UmlsBERT as well? If so, how can I access the updated script? If not, how can I resolve the tui_ids issue?
I am running the scripts on google colab. This is the complete output I get when i run:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt
output:
2021-07-05 09:47:55.207129: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
07/05/2021 09:47:57 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
07/05/2021 09:47:57 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1/runs/Jul05_09-47-57_d7624bb0fdc5,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=150000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=clinicalBert-v1,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
save_on_each_node=False,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
/usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class
AutoModelWithLMHead
is deprecated and will be removed in a future version. Please useAutoModelForCausalLM
for causal language models,AutoModelForMaskedLM
for masked language models andAutoModelForSeq2SeqLM
for encoder-decoder models.FutureWarning,
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
Traceback (most recent call last):
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 355, in
main()
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 248, in main
tui_ids=tui_ids) if training_args.do_train else None
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 136, in get_dataset
tui_ids=tui_ids)
TypeError: init() got an unexpected keyword argument 'tui_ids'
P.S. I am not an expert programmer so do let me know if I should provide any further information as this is the first time i'm submitting an issue.
Thank you.
Best,
Jaya
The text was updated successfully, but these errors were encountered: