There are two implemented models (WordLanguageModel, CharCompLanguageModel) based on these two papers:
- Recurrent Neural Network Regularization (Zaremba, Vinyals, Sutskever) (2014)
- Character-Aware Neural Language Models (Kim, Jernite, Sontag, Rush)
To run the Zaremba model with their "medium regularized LSTM" configuration, early stopping, and pre-trained word vectors:
python trainer.py --config config/ptb-med.json
The "medium regularized LSTM" above (Word Med below) has a lower perplexity than the original paper (even the large model). As noted above, the run above differs in that it uses pre-trained word vectors.
Model | Framework | Dev | Test |
---|---|---|---|
Word Med (Zaremba) | TensorFlow | 80.168 | 77.2213 |
TODO: Add LSTM Char Small Configuration results
The loss that is optimized is the total loss divided by the total number of tokens in the mini-batch (token level loss). This is different than how the loss is calculated in Tensorflow Tutorial but it is how the loss is calculated in awd-lm (Merity et. al, 2017), Elmo (Peters et. al., 2018), OpenAI GPT (Radford et. al., 2018), and BERT (Devlin et. al., 2018)
When reporting the loss every nsteps it is the total loss divided by the total number of tokens in the last nstep number of mini-batches. The perplexity is e to this loss.
The epoch loss is the total loss averaged over the total number of tokens in the whole epoch. The perplexity is e to this loss. This results in token level perplexity which is standard reporting in the literature.