Models

Models

Named Entity Recognition

ParsBERT-NER - It is a fine-tuned model based on ParsBERT (a monolingual Persian language model) on a vast range of dataset PEYMA, ARMAN, and PEYMA+ARMAN.
ALBERT-NER - It is a fine-tuned on PEYMA and ARMAN dataset based on ALBERT Language Model.

Text Classification

Sentiment Analysis

Summarization

BERT2BERT - BERT2BERT is the first pre-trained summarization model trained on Wiki Summary based on ParsBERT.

Question Answering

Multiple-Choice QA

mT5 trained on ParsiNLU-MCQA

Reading Comprehension

mT5 trained on ParsiNLU-RC -

Translation

mT5 trained on ParsiNLU-MT

Textual Entailment

mT5 trained on ParsiNLU-TE

Query Paraphrasing

mT5 trained on ParsiNLU-QP

Embeddings

Farsi Poem word2vec model - This is a word2vec model deveoped based on a corpus of 48 Persian poets. The corpus consists of 1,216,286 mesras of Farsi poems and 8,102,119 words from which 148,588 are unique.
Sentence Transformers - ST is a collection of vector representations for sentences and paragraphs (also known as sentence embeddings). ST models are based on transformer networks like ParsBERT, ALBERT (soon). They are tuned based on Textual Thematic Similarity datasets such that sentences with similar meanings are close in vector space.

Language Model

ParsBERT: Transformer-based Model for Persian Language Understanding) - It is a monolingual language model based on Google’s BERT architecture for the Persian Language only! This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language - ALBERT is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, like the way we did for ParsBERT.

Grapheme to Phoneme

g2p_fa - A Persian Grapheme to Phoneme model using LSTM implemented in pytorch.
Persian_g2p - A seq-to-seq model for Persian (Farsi) Grapheme To Phoneme mapping.
G2P - Attention Based Grapheme To Phoneme