Skip to content

Latest commit

 

History

History
34 lines (21 loc) · 1.97 KB

lm.md

File metadata and controls

34 lines (21 loc) · 1.97 KB

Language Modeling with Recurrent Neural Networks

There are two implemented models (WordLanguageModel, CharCompLanguageModel) based on these two papers:

To run the Zaremba model with their "medium regularized LSTM" configuration, early stopping, and pre-trained word vectors:

python trainer.py --config config/ptb-med.json

Status

The "medium regularized LSTM" above (Word Med below) has a lower perplexity than the original paper (even the large model). As noted above, the run above differs in that it uses pre-trained word vectors.

Model Framework Dev Test
Word Med (Zaremba) TensorFlow 80.168 77.2213

TODO: Add LSTM Char Small Configuration results

Losses and Reporting

The loss that is optimized is the total loss divided by the total number of tokens in the mini-batch (token level loss). This is different than how the loss is calculated in Tensorflow Tutorial but it is how the loss is calculated in awd-lm (Merity et. al, 2017), Elmo (Peters et. al., 2018), OpenAI GPT (Radford et. al., 2018), and BERT (Devlin et. al., 2018)

When reporting the loss every nsteps it is the total loss divided by the total number of tokens in the last nstep number of mini-batches. The perplexity is e to this loss.

The epoch loss is the total loss averaged over the total number of tokens in the whole epoch. The perplexity is e to this loss. This results in token level perplexity which is standard reporting in the literature.