Skip to content

Commit b0dd025

Browse files
committed
add code, experiments and readme
0 parents  commit b0dd025

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+7327
-0
lines changed

.gitignore

+416
Large diffs are not rendered by default.

.pylintrc

+573
Large diffs are not rendered by default.

LICENSE.md

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
MIT License
2+
3+
Copyright (c) 2019 Emil Lynegaard
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6+
7+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8+
9+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

+164
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Pointer Summarization
2+
3+
Accompanying code for Master's Thesis written at the IT-University in Copenhagen on *Neural Automatic Summarization*.
4+
Focused around reimplementing, optimizing and improving the Pointer Generator from [See et al. 2017](https://arxiv.org/abs/1704.04368).
5+
For all the experiments described in the thesis, refer to the [experiments](experiments) folder for corresponding configuration files.
6+
For the chunk-based approaches described in the thesis, a new repository will be published in the near future.
7+
All code has been formatted using [black](https://github.com/python/black).
8+
9+
A pretrained base model can be downloaded [here](https://web.tresorit.com/l#TQNev2-hWI5W81dfh79Z6Q).
10+
If set up correctly, evaluation on the CNNDM test set should produce the following ROUGE F1-scores:
11+
```bash
12+
ROUGE-1: 39.03
13+
ROUGE-2: 17.01
14+
ROUGE-L: 36.25
15+
```
16+
A pretrained base model with a vocabulary of size 20K and unknown token blocking can be download [here](https://web.tresorit.com/l#Vk7w4jdkLekIZwJBUQHUcQ).
17+
Evaluation thereof on the CNNDM test should produce the following ROUGE F1-scores:
18+
```bash
19+
ROUGE-1: 39.26
20+
ROUGE-2: 17.16
21+
ROUGE-L: 36.42
22+
```
23+
24+
**NOTE**: the code provided supports several experimental features that were not discussed in the thesis.
25+
26+
## Overview:
27+
1. [Quick Start](#quick-start)
28+
2. [Setup](#setup)
29+
1. [Dependencies](#dependencies)
30+
2. [Data](#data)
31+
1. [Newsroom](#newsroom)
32+
2. [New York Times](#new-york-times)
33+
3. [Training](#training)
34+
4. [Evaluation](#evaluation)
35+
5. [License](#license)
36+
37+
# Quick Start
38+
- Install all dependencies according to the [Dependencies](#dependencies) section
39+
- Download preprocessed CNNDM data [here](https://web.tresorit.com/l#Ha8s-v4PCbsyxe9X00Ojnw) and extract it into the [data](data) folder
40+
- Train and evaluate a base model with `python train.py -cfg experiments/base.yaml --eval`
41+
- Alternatively, evaluate a pretrained model with `python evaluate.py log/base.tar`
42+
43+
# Setup
44+
All development and testing was done using PyTorch 1.0.1 and Python 3.7.3, but other versions may work fine.
45+
46+
## Dependencies
47+
```bash
48+
pip install -r requirements.txt
49+
python -c "import nltk; nltk.download('punkt')"
50+
```
51+
52+
### ROUGE
53+
To use the official ROUGE 155 Perl implementation, download it [here](https://web.tresorit.com/l#BPSRMOtfRtK3PE8vjL-U9Q) and extract it into the [tools](tools) folder.
54+
You should now have 'ROUGE-1.5.5' folder inside your tools folder.
55+
The python wrapper [pyrouge](https://pypi.org/project/pyrouge/) is set up to use now extracted folder.
56+
Alternatively, modify [evaluate.py](evaluate.py) to use a system-wide ROUGE configuration, or evaluate using [py-rouge](https://pypi.org/project/py-rouge/) (see [Evaluation](#evaluation) section).
57+
58+
The official ROUGE 155 Perl implementation, rely on libraries that may not be installed by default.
59+
We provide instructions for Arch Linux and Ubuntu:
60+
- On Arch Linux: `sudo pacman -S perl-xml-xpath`
61+
- On Ubuntu: `sudo apt-get install libxml-parser-perl`
62+
63+
For possible tips for installation on Windows, or in general, refer to [this](https://stackoverflow.com/questions/47045436/how-to-install-the-python-package-pyrouge-on-microsoft-windows) StackOverflow post.
64+
65+
66+
## Data
67+
We provide preprocessed data for CNNDM [here](https://web.tresorit.com/l#Ha8s-v4PCbsyxe9X00Ojnw).
68+
The tarfile includes train, dev and test set, as well as vocabularies both with and without proper noun filtering.
69+
For easy setup, extract the tarfile into the data directory.
70+
71+
To manually preprocess CNNDM, refer to Abigail See's [repository](https://github.com/abisee/cnn-dailymail).
72+
To download already preprocessed CNNDM data according to Abigail See's repository, refer to Jaffer Wilson's [repository](https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail).
73+
Once downloaded/preprocessed, you will have files in binary format.
74+
We use tsv-files instead, to which binary files can be converted, using the following script:
75+
```bash
76+
# note: 'bin_to_tsv.py' depends on TensorFlow which is not in requirements.txt
77+
python tools/bin_to_tsv.py path/to/train.bin data/cnndm_abisee_train.tsv
78+
```
79+
80+
To create a new vocabulary file of size 50,000 from a tsv-file, use [vocabulary.py](vocabulary.py):
81+
```bash
82+
python vocabulary.py cnndm_abisee_train.tsv cnndm_abisee.vocab 50000
83+
```
84+
85+
Lastly, in case one wants to train using pretrained GLoVe embeddings, download them from the [official GLoVe website](https://nlp.stanford.edu/projects/glove://nlp.stanford.edu/projects/glove/), and convert them to a compatible format using:
86+
```bash
87+
python tools/glove_to_w2v.py path/to/glove.6B.100d.txt data/glove100.w2v
88+
```
89+
90+
### Newsroom
91+
The Newsroom dataset can be downloaded from its [official website](https://summari.es/).
92+
For preprocessing, we supply a [simple script](tools/preprocess_newsroom.py) using NLTK, which can be used as follows:
93+
```bash
94+
python tools/preprocess_newsroom.py release/train.jsonl.gz newsroom_train.tsv
95+
```
96+
97+
98+
### New York Times
99+
The New York Times Annotated Corpus can be acquired through [LDC](https://catalog.ldc.upenn.edu/LDC2008T19).
100+
For preprocessing, we follow [Paulus et al. 2017](https://arxiv.org/abs/1705.04304), using a [script](tools/preprocess_nyt.py) supplied by the authors. Note that this requires a local [CoreNLP](https://github.com/stanfordnlp/CoreNLP) server and that the script takes a long time to run.
101+
102+
## Training
103+
To train a model using one of the configuration files supplied, use the following command:
104+
```bash
105+
python train.py -cfg experiments/base.yaml
106+
# config files can also have paramterers overwritten on the fly
107+
python train.py -cfg experiments/base.yaml --rnn_cell lstm
108+
```
109+
110+
To resume a model that was cancelled/interrupted use:
111+
```bash
112+
python train.py --resume_from log/model.tar
113+
# optionally, some parameters can be changed when resuming
114+
python train.py --resume_from log/model.tar --batch_size 32
115+
```
116+
117+
To resume training a model that has trained without coverage, while also converting it to now use coverage, use:
118+
```bash
119+
# note that this is only tested with default attention configuration
120+
python train.py --resume_from log/model.tar --convert_to_coverage
121+
```
122+
123+
See [train.py](train.py) and [config.py](config.py) for all possible options.
124+
125+
## Evaluation
126+
To evaluate a model on some test set using official ROUGE 155 Perl implementation:
127+
```bash
128+
python evaluate.py log/model.tar path/to/test_set.tsv
129+
```
130+
131+
To evaluate using [py-rouge](https://pypi.org/project/py-rouge/), a Python reimplementation with less dependencies, use:
132+
```bash
133+
# note that py-rouge does not produce rouge scores identical to the perl implementation
134+
python evaluate.py log/model.tar path/to/test_set.tsv --use_python
135+
```
136+
137+
We support many different test-time parameters that can be given to evaluate.
138+
Refer to [config.py](config.py) and possibly [beam_search.py](beam_search.py) for all options.
139+
Some example uses of said options follow:
140+
```bash
141+
python evaluate.py log/model.tar path/to/test_set.tsv --length_normalize wu --length_normalize_alpha 1.0
142+
python evaluate.py log/model.tar path/to/test_set.tsv --beam_size 8
143+
python evaluate.py log/model.tar path/to/test_set.tsv --block_ngram_repeat 3
144+
python evaluate.py log/model.tar path/to/test_set.tsv --block_unknown
145+
146+
# save summaries and configuration to a json-file
147+
python evaluate.py log/model.tar path/to/test_set.tsv --save
148+
```
149+
Note that [util.py](util.py) can be used to easily inspect the attributes of a model (see module documentation for further information).
150+
151+
## License
152+
**NOTE:** [preprocess_nyt.py](tools/preprocess_nyt.py), [plot.py](tools/plot.py) and [jsonl.py](tools/jsonl.py) all have separate licenses. See each file's header for specifics.
153+
All other code is distributed under MIT:
154+
___
155+
156+
MIT License
157+
158+
Copyright (c) 2019 Emil Lynegaard
159+
160+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
161+
162+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
163+
164+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

0 commit comments

Comments
 (0)