|
| 1 | +# Pointer Summarization |
| 2 | + |
| 3 | +Accompanying code for Master's Thesis written at the IT-University in Copenhagen on *Neural Automatic Summarization*. |
| 4 | +Focused around reimplementing, optimizing and improving the Pointer Generator from [See et al. 2017](https://arxiv.org/abs/1704.04368). |
| 5 | +For all the experiments described in the thesis, refer to the [experiments](experiments) folder for corresponding configuration files. |
| 6 | +For the chunk-based approaches described in the thesis, a new repository will be published in the near future. |
| 7 | +All code has been formatted using [black](https://github.com/python/black). |
| 8 | + |
| 9 | +A pretrained base model can be downloaded [here](https://web.tresorit.com/l#TQNev2-hWI5W81dfh79Z6Q). |
| 10 | +If set up correctly, evaluation on the CNNDM test set should produce the following ROUGE F1-scores: |
| 11 | +```bash |
| 12 | +ROUGE-1: 39.03 |
| 13 | +ROUGE-2: 17.01 |
| 14 | +ROUGE-L: 36.25 |
| 15 | +``` |
| 16 | +A pretrained base model with a vocabulary of size 20K and unknown token blocking can be download [here](https://web.tresorit.com/l#Vk7w4jdkLekIZwJBUQHUcQ). |
| 17 | +Evaluation thereof on the CNNDM test should produce the following ROUGE F1-scores: |
| 18 | +```bash |
| 19 | +ROUGE-1: 39.26 |
| 20 | +ROUGE-2: 17.16 |
| 21 | +ROUGE-L: 36.42 |
| 22 | +``` |
| 23 | + |
| 24 | +**NOTE**: the code provided supports several experimental features that were not discussed in the thesis. |
| 25 | + |
| 26 | +## Overview: |
| 27 | +1. [Quick Start](#quick-start) |
| 28 | +2. [Setup](#setup) |
| 29 | + 1. [Dependencies](#dependencies) |
| 30 | + 2. [Data](#data) |
| 31 | + 1. [Newsroom](#newsroom) |
| 32 | + 2. [New York Times](#new-york-times) |
| 33 | +3. [Training](#training) |
| 34 | +4. [Evaluation](#evaluation) |
| 35 | +5. [License](#license) |
| 36 | + |
| 37 | +# Quick Start |
| 38 | +- Install all dependencies according to the [Dependencies](#dependencies) section |
| 39 | +- Download preprocessed CNNDM data [here](https://web.tresorit.com/l#Ha8s-v4PCbsyxe9X00Ojnw) and extract it into the [data](data) folder |
| 40 | +- Train and evaluate a base model with `python train.py -cfg experiments/base.yaml --eval` |
| 41 | +- Alternatively, evaluate a pretrained model with `python evaluate.py log/base.tar` |
| 42 | + |
| 43 | +# Setup |
| 44 | +All development and testing was done using PyTorch 1.0.1 and Python 3.7.3, but other versions may work fine. |
| 45 | + |
| 46 | +## Dependencies |
| 47 | +```bash |
| 48 | +pip install -r requirements.txt |
| 49 | +python -c "import nltk; nltk.download('punkt')" |
| 50 | +``` |
| 51 | + |
| 52 | +### ROUGE |
| 53 | +To use the official ROUGE 155 Perl implementation, download it [here](https://web.tresorit.com/l#BPSRMOtfRtK3PE8vjL-U9Q) and extract it into the [tools](tools) folder. |
| 54 | +You should now have 'ROUGE-1.5.5' folder inside your tools folder. |
| 55 | +The python wrapper [pyrouge](https://pypi.org/project/pyrouge/) is set up to use now extracted folder. |
| 56 | +Alternatively, modify [evaluate.py](evaluate.py) to use a system-wide ROUGE configuration, or evaluate using [py-rouge](https://pypi.org/project/py-rouge/) (see [Evaluation](#evaluation) section). |
| 57 | + |
| 58 | +The official ROUGE 155 Perl implementation, rely on libraries that may not be installed by default. |
| 59 | +We provide instructions for Arch Linux and Ubuntu: |
| 60 | +- On Arch Linux: `sudo pacman -S perl-xml-xpath` |
| 61 | +- On Ubuntu: `sudo apt-get install libxml-parser-perl` |
| 62 | + |
| 63 | +For possible tips for installation on Windows, or in general, refer to [this](https://stackoverflow.com/questions/47045436/how-to-install-the-python-package-pyrouge-on-microsoft-windows) StackOverflow post. |
| 64 | + |
| 65 | + |
| 66 | +## Data |
| 67 | +We provide preprocessed data for CNNDM [here](https://web.tresorit.com/l#Ha8s-v4PCbsyxe9X00Ojnw). |
| 68 | +The tarfile includes train, dev and test set, as well as vocabularies both with and without proper noun filtering. |
| 69 | +For easy setup, extract the tarfile into the data directory. |
| 70 | + |
| 71 | +To manually preprocess CNNDM, refer to Abigail See's [repository](https://github.com/abisee/cnn-dailymail). |
| 72 | +To download already preprocessed CNNDM data according to Abigail See's repository, refer to Jaffer Wilson's [repository](https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail). |
| 73 | +Once downloaded/preprocessed, you will have files in binary format. |
| 74 | +We use tsv-files instead, to which binary files can be converted, using the following script: |
| 75 | +```bash |
| 76 | +# note: 'bin_to_tsv.py' depends on TensorFlow which is not in requirements.txt |
| 77 | +python tools/bin_to_tsv.py path/to/train.bin data/cnndm_abisee_train.tsv |
| 78 | +``` |
| 79 | + |
| 80 | +To create a new vocabulary file of size 50,000 from a tsv-file, use [vocabulary.py](vocabulary.py): |
| 81 | +```bash |
| 82 | +python vocabulary.py cnndm_abisee_train.tsv cnndm_abisee.vocab 50000 |
| 83 | +``` |
| 84 | + |
| 85 | +Lastly, in case one wants to train using pretrained GLoVe embeddings, download them from the [official GLoVe website](https://nlp.stanford.edu/projects/glove://nlp.stanford.edu/projects/glove/), and convert them to a compatible format using: |
| 86 | +```bash |
| 87 | +python tools/glove_to_w2v.py path/to/glove.6B.100d.txt data/glove100.w2v |
| 88 | +``` |
| 89 | + |
| 90 | +### Newsroom |
| 91 | +The Newsroom dataset can be downloaded from its [official website](https://summari.es/). |
| 92 | +For preprocessing, we supply a [simple script](tools/preprocess_newsroom.py) using NLTK, which can be used as follows: |
| 93 | +```bash |
| 94 | +python tools/preprocess_newsroom.py release/train.jsonl.gz newsroom_train.tsv |
| 95 | +``` |
| 96 | + |
| 97 | + |
| 98 | +### New York Times |
| 99 | +The New York Times Annotated Corpus can be acquired through [LDC](https://catalog.ldc.upenn.edu/LDC2008T19). |
| 100 | +For preprocessing, we follow [Paulus et al. 2017](https://arxiv.org/abs/1705.04304), using a [script](tools/preprocess_nyt.py) supplied by the authors. Note that this requires a local [CoreNLP](https://github.com/stanfordnlp/CoreNLP) server and that the script takes a long time to run. |
| 101 | + |
| 102 | +## Training |
| 103 | +To train a model using one of the configuration files supplied, use the following command: |
| 104 | +```bash |
| 105 | +python train.py -cfg experiments/base.yaml |
| 106 | +# config files can also have paramterers overwritten on the fly |
| 107 | +python train.py -cfg experiments/base.yaml --rnn_cell lstm |
| 108 | +``` |
| 109 | + |
| 110 | +To resume a model that was cancelled/interrupted use: |
| 111 | +```bash |
| 112 | +python train.py --resume_from log/model.tar |
| 113 | +# optionally, some parameters can be changed when resuming |
| 114 | +python train.py --resume_from log/model.tar --batch_size 32 |
| 115 | +``` |
| 116 | + |
| 117 | +To resume training a model that has trained without coverage, while also converting it to now use coverage, use: |
| 118 | +```bash |
| 119 | +# note that this is only tested with default attention configuration |
| 120 | +python train.py --resume_from log/model.tar --convert_to_coverage |
| 121 | +``` |
| 122 | + |
| 123 | +See [train.py](train.py) and [config.py](config.py) for all possible options. |
| 124 | + |
| 125 | +## Evaluation |
| 126 | +To evaluate a model on some test set using official ROUGE 155 Perl implementation: |
| 127 | +```bash |
| 128 | +python evaluate.py log/model.tar path/to/test_set.tsv |
| 129 | +``` |
| 130 | + |
| 131 | +To evaluate using [py-rouge](https://pypi.org/project/py-rouge/), a Python reimplementation with less dependencies, use: |
| 132 | +```bash |
| 133 | +# note that py-rouge does not produce rouge scores identical to the perl implementation |
| 134 | +python evaluate.py log/model.tar path/to/test_set.tsv --use_python |
| 135 | +``` |
| 136 | + |
| 137 | +We support many different test-time parameters that can be given to evaluate. |
| 138 | +Refer to [config.py](config.py) and possibly [beam_search.py](beam_search.py) for all options. |
| 139 | +Some example uses of said options follow: |
| 140 | +```bash |
| 141 | +python evaluate.py log/model.tar path/to/test_set.tsv --length_normalize wu --length_normalize_alpha 1.0 |
| 142 | +python evaluate.py log/model.tar path/to/test_set.tsv --beam_size 8 |
| 143 | +python evaluate.py log/model.tar path/to/test_set.tsv --block_ngram_repeat 3 |
| 144 | +python evaluate.py log/model.tar path/to/test_set.tsv --block_unknown |
| 145 | + |
| 146 | +# save summaries and configuration to a json-file |
| 147 | +python evaluate.py log/model.tar path/to/test_set.tsv --save |
| 148 | +``` |
| 149 | +Note that [util.py](util.py) can be used to easily inspect the attributes of a model (see module documentation for further information). |
| 150 | + |
| 151 | +## License |
| 152 | +**NOTE:** [preprocess_nyt.py](tools/preprocess_nyt.py), [plot.py](tools/plot.py) and [jsonl.py](tools/jsonl.py) all have separate licenses. See each file's header for specifics. |
| 153 | +All other code is distributed under MIT: |
| 154 | +___ |
| 155 | + |
| 156 | +MIT License |
| 157 | + |
| 158 | +Copyright (c) 2019 Emil Lynegaard |
| 159 | + |
| 160 | +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: |
| 161 | + |
| 162 | +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. |
| 163 | + |
| 164 | +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
0 commit comments