Skip to content

Commit

Permalink
updates readme with cleanup changes
Browse files Browse the repository at this point in the history
  • Loading branch information
jeffreyling committed Feb 25, 2016
1 parent a1b99b7 commit 0ddcecb
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 28 deletions.
56 changes: 30 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Sentence Convolution Code in Torch

This code implements Kim (2014) sentence convolution code in torch with GPUs. It replicates his results on existing datasets, and allows training of models on arbitrary other text datasets.
This code implements Kim (2014) sentence convolution code in Torch with GPUs. It replicates the results on existing datasets, and allows training of models on arbitrary other text datasets.

## Quickstart

To make data in hdf5 format, run the following (with word2vec .bin path and choice of dataset):

python make_hdf5.py /path/to/word2vec.bin MR
python preprocess.py MR /path/to/word2vec.bin

To run training with GPUs:

Expand All @@ -20,18 +20,20 @@ The training pipeline requires Python hdf5 (the h5py module) and the following l
* hdf5
* cudnn

Training on word2vec architecture models requires downloading [word2vec](https://code.google.com/p/word2vec/) and unzipping.
Training on word2vec architecture models requires downloading [word2vec](https://code.google.com/p/word2vec/) and unzipping. Simply run the script

./get_word2vec.sh

## Creating datasets

We process the following datasets: `MR, SST1, SST2, Subj, TREC, CR, MPQA`.
We provide the following datasets: `MR, SST1, SST2, SUBJ, TREC, CR, MPQA`.
All raw training data is located in the `data/` directory. The `SST1, SST2` data have both test and dev sets, and TREC has a test set.

The data takes word2vec embeddings, processes the vocabulary, and outputs a data matrix of vocabulary indices for each sentence.

To create the hdf5 file, run the following with DATASET as one of the described datasets:

python make_hdf5.py /path/to/word2vec.bin DATASET
python preprocess.py DATASET /path/to/word2vec.bin

The script outputs:
* the `DATASET.hdf5` file with the data matrix and word2vec embeddings
Expand All @@ -47,11 +49,13 @@ Example line:

Then run:

python make_hdf5.py /path/to/word2vec.bin custom /path/to/train/data
python preprocess.py custom /path/to/word2vec.bin --train /path/to/train/data --test /path/to/test/data --dev /path/to/dev/data

The output file's name can be set with the flag `--custom_name` (default is named custom).

## Running torch

Training is done with 10-fold cross-validation and 25 epochs. If the data set comes with a test set, we don't do cross validation (but split training data 90/10 for the dev set). If the data comes with the dev set, we don't do additional preprocessing.
Training is typically done with 10-fold cross-validation and 25 epochs. If the data set comes with a test set, we don't do cross validation (but split training data 90/10 for the dev set). If the data comes with the dev set, we don't do the split for train/dev.

There are four main model architectures we implemented, as described in Kim (2014): `rand, static, nonstatic, multichannel`.
* `rand` initializes the word embeddings randomly and learns them.
Expand All @@ -71,43 +75,43 @@ A few modifications were made to the model architecture as experiments.

Results from these experiments are described below in the Results section.

### Output

When training is complete, the code outputs a file with name -savefile, with default `TIMESTAMP_results.t7`.

The following are saved as a table:
* `dev_scores` with dev scores,
* `test scores` with test scores,
* `opt` with model parameters,
* `model` with best model (as determined by cross-validation)
* `embeddings` with the updated word embeddings

### Parameters

The following parameters are allowed by the torch code.
* `cudnn`: Use GPUs if set to 1, otherwise set to 0
* `num_epochs`: Number of training epochs.
The following is a list of complete parameters allowed by the torch code.
* `model_type`: Model architecture, as described above. Options: rand, static, nonstatic, multichannel
* `data`: Training dataset to use, including word2vec data. This should be a `.hdf5` file made with `make_hdf5.py`.
* `data`: Training dataset to use, including word2vec data. This should be a `.hdf5` file made with `preprocess.py`.
* `cudnn`: Use GPUs if set to 1, otherwise set to 0
* `seed`: Random seed, set to -1 for actual randomness
* `folds`: Number of folds for cross-validation.
* `has_test`: Set 1 if data has test set
* `has_dev`: Set 1 if data has dev set
* `zero_indexing`: Set 1 if data is zero indexed
* `debug`: Print debugging info including timing and confusions
* `savefile`: Name of output `.t7` file, which will hold the trained model. Default is `TIMESTAMP_results`
* `zero_indexing`: Set 1 if data is zero indexed

Training parameters:
* `num_epochs`: Number of training epochs.
* `optim_method`: Gradient descent method. Options: adadelta, adam
* `L2s`: Set L2 norm of final linear layer weights to this.
* `batch_size`: Batch size for training.

Model parameters:
* `num_feat_maps`: Number of convolution feature maps.
* `kernel1`, `kernel2`, `kernel3`: Kernel size of different convolutions.
* `kernels`: Kernel sizes of different convolutions.
* `dropout_p`: Dropout probability.
* `num_classes`: Number of prediction classes.
* `highway_mlp`: Number of highway MLP layers (0 for none)
* `highway_conv_layers`: Number of highway convolutional layers (0 for none)
* `skip_kernel`: Set 1 to use skip kernels

### Output

When training is complete, the code outputs the following table into a file `TIMESTAMP_results.t7`:
* `dev_scores` with dev scores,
* `test scores` with test scores,
* `opts` with model parameters,
* `model` with best model (as determined by cross-validation)
* `embeddings` if nonstatic model type, with the updated word embeddings

## Results

The following results were collected with the same training setup as in Kim (2014) (same parameters, 10-fold cross validation if data has no test set, 25 epochs).
Expand Down Expand Up @@ -141,7 +145,7 @@ From these results, we see that using GPUs achieves almost a 50x speedup on trai

## Relevant publications

This code is based on Kim (2014) and its corresponding Theano [code](https://github.com/yoonkim/CNN_sentence/).
This code is based on Kim (2014) and the corresponding Theano [code](https://github.com/yoonkim/CNN_sentence/).

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Doha, Qatar. Association for Computational Linguistics.

Expand Down
5 changes: 3 additions & 2 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,11 @@ The following datasets are included in this directory:

## Data files

Dataset | Files
--- | ---
MR | rt-polarity.all
SST-1 | stsa.fine.\*
SST-2 | stsa.binary.\*
SST-1 | stsa.fine.\* (use phrases for train)
SST-2 | stsa.binary.\* (use phrases for train)
Subj | subj.all
TREC | TREC.\*
CR | custrev.all
Expand Down

0 comments on commit 0ddcecb

Please # to comment.