GitHub - Asrst/telugu_nlp: Tokenizers, Language Models & other NLP utils for Telugu Language

NLP for Telugu Language

ULMFit Telugu Encodings Projections

Features:

Datasets

datasets are in parquet format: use pandas to read/transform.
download telugu Wikipedia Dataset (Around 90K articles)
download telugu Eenaadu news Dataset (Around 20K articles)

Tokenizers

Sentencepiece tokenizer trained on Telugu Wikipedia data of nearly 90K articles.
telugu sentencepiece tokenizer
telugu sentencepiece vocab

Language Models

ULMFit trained on Telugu Wikipedia data of nearly 90K articles & its 400 dimensional word vector Encodings for 25000 most frequent vocab.json)
telugu language model (fastai-awdlstm)
for other intermediate training files, refer the kaggle kernel telugu LM (fastai-awdlstm)

Classification

created the Eenaadu newspaper dataset of around 20K articles constituting of 5 classes. With finetuned language model encodings, the classification accuracy is 95%.
classification model on eenaadu news data (5 classes)

Requirements:

code tested on following versions, however should work with pytorch v1.0+ , fastai v1.0+ , python v3.5+

python v3.6 pytorch==1.3.0 fastai==1.0.59

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
classification		classification
configs		configs
datasets		datasets
lang_models		lang_models
tokenizers		tokenizers
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP for Telugu Language

Features:

Datasets

Tokenizers

Language Models

Classification

Requirements:

citations & references

About

Releases

Packages

Languages

Asrst/telugu_nlp

Folders and files

Latest commit

History

Repository files navigation

NLP for Telugu Language

Features:

Datasets

Tokenizers

Language Models

Classification

Requirements:

citations & references

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages