Codes for the paper Structural Pre-training for Dialogue Comprehension
Since the datasets are quite large that exceed the Github file size limit, we only upload part of the data as examples. Do not forget to change to the data directory after you download the full data.
- Datasets can be download from Ubuntu dataset, Douban dataset, and ECD dataset.
- Unzip the dataset and put data directory into
data/
.
Use the scripts svo_annotate.py
and svo_combine.py
to generate the pre-annotated SVO files for quick access in model training.
*Note: To avoid being killed due to large memory use, we separete the annotation with start_id and end_id in svo_annotate.py
, and combine those annotated files with svo_combine.py
.
The steps to further pre-training BERT with NUP strategy is introduced as follows. We also provide the language model trained on Ubuntu training set. Our trained nup language model on Ubuntu training set can be accessed here.
https://www.dropbox.com/s/d1earb9ta6drqoy/ubuntu_nup_bert_base.zip?dl=0
You can unzip the model and put it into ubuntu_nup_bert_base
directory then use it during model training.
The following scripts take the sbr objective training as example.
- Run
make_lm_data.py
to process the original training data format into a single file with one sentence(utterance) per line, and one blank line between documents(dialog context).
python nup_lm_finetuning/make_lm_data.py \
--data_file ../data/ubuntu_data/train.txt \
--output_file data/ubuntu_data/lm_train.txt
- Use
pregenerate_training_data_sbr.py
to pre-process the data into training examples following the NUP methodology.
python nup_lm_finetuning/pregenerate_training_data_sbr.py \
--train_corpus ../data/ubuntu_data/lm_train.txt \
--bert_model bert-base-uncased \
--do_lower_case \
--output_dir ../data/ubuntu_data/ubuntu_sbr \
--epochs_to_generate 1 \
--max_seq_len 512
- Train on the pregenerated data using
finetune_on_pregenerated_sbr.py
, and pointing it to the folder created bypregenerate_training_data.py
.
python nup_lm_finetuning/finetune_on_pregenerated_sbr.py \
--pregenerated_data ../data/ubuntu_data/ubuntu_sbr \
--bert_model bert-base-uncased \
--train_batch_size 64 \
--do_lower_case \
--output_dir ubuntu_finetuned_lm_sbr \
--epochs 1
-
Train a model
Change the
--bert_model
parameter to the path of the pretrained language model if need. Example asubuntu_finetuned_lm
for Ubuntu dataset.An example:
python run_bert_sbr.py \
--data_dir data/ubuntu_data \
--task_name ubuntu \
--train_batch_size 64 \
--eval_batch_size 64 \
--max_seq_length 384 \
--max_utterance_num 20
--bert_model bert-base-uncased \
--cache_flag ubuntu \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--do_train \ # set to do_eval when evaluation on test set
--do_lower_case \
--output_dir experiments/ubuntu
Python 3.6 + Pytorch 1.0.1