ULMFit Telugu Encodings Projections
- datasets are in parquet format: use pandas to read/transform.
- download telugu Wikipedia Dataset (Around 90K articles)
- download telugu Eenaadu news Dataset (Around 20K articles)
- Sentencepiece tokenizer trained on Telugu Wikipedia data of nearly 90K articles.
- telugu sentencepiece tokenizer
- telugu sentencepiece vocab
- ULMFit trained on Telugu Wikipedia data of nearly 90K articles & its 400 dimensional word vector Encodings for 25000 most frequent vocab.json)
- telugu language model (fastai-awdlstm)
- for other intermediate training files, refer the kaggle kernel telugu LM (fastai-awdlstm)
- created the Eenaadu newspaper dataset of around 20K articles constituting of 5 classes. With finetuned language model encodings, the classification accuracy is 95%.
- classification model on eenaadu news data (5 classes)
code tested on following versions, however should work with pytorch v1.0+ , fastai v1.0+ , python v3.5+
python v3.6
pytorch==1.3.0
fastai==1.0.59