We provide the source code for the paper "Structure-Infused Copy Mechanisms for Abstractive Summarization", accepted at COLING'18. If you find the code useful, please cite the following paper.
@inproceedings{song-zhao-liu:2018,
Author = {Kaiqiang Song and Lin Zhao and Fei Liu},
Title = {Structure-Infused Copy Mechanisms for Abstractive Summarization},
Booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING)},
Year = {2018}}
-
Our system seeks to re-write a lengthy sentence, often the 1st sentence of a news article, to a concise, title-like summary. The average input and output lengths are 31 words and 8 words, respectively.
-
The code takes as input a text file with one sentence per line. It generates a text file in the same directory as the output, ended with ".result.summary", where each source sentence is replaced by a title-like summary.
-
Example input and output are shown below.
An estimated 4,645 people died in Hurricane Maria and its aftermath in Puerto Rico , according to an academic report published Tuesday in a prestigious medical journal .
hurricane maria kills 4,645 in puerto rico .
The code is written in Python (v2.7) and Theano (v1.0.1). We suggest the following environment:
- A Linux machine (Ubuntu) with GPU (Cuda 8.0)
- Python (v2.7)
- Theano (v1.0.1)
- Stanford CoreNLP
- Pyrouge
To install Python (v2.7), run the command:
$ wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ bash Anaconda2-5.0.1-Linux-x86_64.sh
$ source ~/.bashrc
To install Theano and its dependencies, run the below command (you may want to add export MKL_THREADING_LAYER=GNU
to "~/.bashrc" for future use).
$ conda install numpy scipy mkl nose sphinx pydot-ng
$ conda install theano pygpu
$ export MKL_THREADING_LAYER=GNU
To download the Stanford CoreNLP toolkit and use it as a server, run the command below. The CoreNLP toolkit helps derive structure information (part-of-speech tags, dependency parse trees) from source sentences.
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
$ unzip stanford-corenlp-full-2018-02-27.zip
$ cd stanford-corenlp-full-2018-02-27
$ nohup java -mx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
$ cd -
To install Pyrouge, run the command below. Pyrouge is a Python wrapper for the ROUGE toolkit, an automatic metric used for summary evaluation.
$ pip install pyrouge
-
Clone this repo. Download this TAR file (
model_coling18.tar.gz
) containing vocabulary files and pretrained models. Move the TAR file to folder "struct_infused_summ" and uncompress.$ git clone https://github.com/KaiQiangSong/struct_infused_summ/ $ mv model_coling18.tar.gz struct_infused_summ $ cd struct_infused_summ $ tar -xvzf model_coling18.tar.gz $ rm model_coling18.tar.gz
-
Extract structural features from a list of input files. The file
./test_data/test_filelist.txt
contains absolute (or relative) paths to individual files (test_000.txt and test_001.txt are toy files). Each file contains a number of source sentences, one sentence per line. Then, execute the command:$ python toolkit.py -f ./test_data/test_filelist.txt
-
Generate the model configuration file in the
./settings/
folder.$ python genTestDataSettings.py ./test_data/test_filelist.txt ./settings/my_test_settings
After that, you need to modify the "dataset" field of the
options_loader.py
file to point it to the new settings file:'dataset':'settings/my_test_settings.json'
. -
Run the testing script. The summary files, located in the same directory as the input, are ended with ".result.summary".
$ python generate.py
struct_edge
is the default model. It corresponds to the "2way+relation" architecture described in the paper. You can modify the filegenerate.py
(Line 152-153) by globally replacingstruct_edge
withstruct_node
to enable the "2way+word" architecture.
-
Create a folder to save the model files.
./model/struct_node
is for the "2way+word" architecture and./model/struct_edge
for the "2way+relation" architecture.$ mkdir -p ./model/struct_node ./model/struct_edge
-
Extract structural features from the input files.
source_file.txt
andsummary_file.txt
in the./train_data/
folder are toy files containing source and summary sentences, one sentence per line. Often, tens of thousands of (source, sentence) pairs are required for training.$ python toolkit.py ./train_data/source_file.txt $ python toolkit.py ./train_data/summary_file.txt
Adjust file names using below commands.
.Ndocument
,.dfeature
, andNsummary
respectively contain the source sentences, structural features of source sentences, and summary sentences.$ cd ./train_data/ $ mv source_file.txt.Ndocument train.Ndocument $ mv source_file.txt.feature train.dfeature $ mv summary_file.txt.Ndocument train.Nsummary $ cd -
-
Repeat the previous step for validation data, which are used for early stopping.
./valid_data
contain toy files.$ python toolkit.py ./valid_data/source_file.txt $ python toolkit.py ./valid_data/summary_file.txt $ cd ./valid_data/ $ mv source_file.txt.Ndocument valid.Ndocument $ mv source_file.txt.feature valid.dfeature $ mv summary_file.txt.Ndocument valid.Nsummary $ cd -
-
Generate the model configuration file in the
./settings/
folder.$ python genTrainDataSettings.py ./train_data/train ./valid_data/valid ./settings/my_train_settings
After that, you need to modify the "dataset" field of the
options_loader.py
file to point to the new settings file:'dataset':'settings/my_train_settings.json'
. -
Download the GloVe embeddings and uncompress.
$ wget http://nlp.stanford.edu/data/glove.6B.zip $ unzip glove.6B.zip $ rm glove.6B.zip
Modify the "vocab_emb_init_path" field in the file
./settings/vocabulary.json
from"vocab_emb_init_path": "../../vocab/glove.6B.100d.txt"
to"vocab_emb_init_path": "glove.6B.100d.txt"
. -
Create a vocabulary file from
./train_data/train.Ndocument
and./train_data/train.Nsummary
. Words appearing less than 5 times are excluded.$ python get_vocab.py my_vocab
-
Modify the path to the vocabulary file in
train.py
fromVocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab')
toVocab_Giga = loadFromPKL('my_vocab.Vocab')
. -
To train the model, run the below command.
$ THEANO_FLAGS='floatX=float32' python train.py
The training program stops when it reaches the maximum number of epoches (30 epoches). This number can be modified by changing the
"max_epochs"
field in./settings/training.json
. The model files are saved in folder./model/
."2way+relation" is the default architecture. It uses the settings file
./settings/network_struct_edge.json
. You can modify the 'network' field of theoptions_loader.py
from'settings/network_struct_edge.json'
to'./settings/network_struct_node.json'
to train the "2way+word" architecture. -
(Optional) train the model with early stopping.
You might want to change the paramters used for early stopping. These are specified in
./setttings/earlyStop.json
and explained below. If early stopping is enabled, the best model files,model_best.npz
andoptions_best.json
, will be saved in the./model/struct_edge/
folder.
{
"sample":true, # enable model checkpoint
"sampleMin":10000, # the first checkpoint occurs after 10K batches
"sampleFreq":2000, # there is a checkpoint every 2K batches afterwards
"sample_path":"./sample/",
"earlyStop":true, # enable early stopping
"earlyStop_method":"valid_err", # based on validation loss
"earlyStop_bound":62000, # the training program stops if the valid loss has no improvement after 62K batches
"rate_bound":24000 # halve the learning rate if the valid loss has no improvement after 2K batches
}
62K batches (used for earlyStop_bound
) correspond to about 1 epoch for our dataset. 24K batches (used for rate_Bound
) is slightly less than half of an epoch.
-
You will switch to the file
train_2.py
. Modify the path to the vocabulary file intrain_2.py
fromVocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab')
toVocab_Giga = loadFromPKL('my_vocab.Vocab')
to point it to your vocabulary file. -
Run the below command to perform the 2nd-stage training. Two files
./model/struct_edge/model_check2_best.npz
and./model/struct_edge/options_check2_best.json
will be generated, containing the best model parameters and system configurations for the "2way+relation" architecture.$ python train_2.py
This project is licensed under the BSD License - see the LICENSE.md file for details.
We grateful acknowledge the work of Kelvin Xu whose code in part inspired this project.