An open, broad-coverage corpus for informal Persian named entity recognition collected from Twitter.
Version 1.0:
- zip package: ParsTwiNER-v1.0.zip
Recommended. This is the first complete, stable release of the corpus and the version used in our experiments with the data.
A version of the corpus data is found in CoNLL-like format in the following files:
twitter_data/persian-ner-twitter-data/train.txt
: training datatwitter_data/persian-ner-twitter-data/dev.txt
: development datatwitter_data/persian-ner-twitter-data/test.txt
: test data
These files are in a simple two-column tab-separated format with IOB2 tags:
این O
تاجالدین B-PER
همونه O
که O
دخترش O
دور O
قبل O
نماینده O
اصفهان B-LOC
بود O
The corpus annotation marks mentions of person (PER
), organization (ORG
), location (LOC
), nations (NAT
), political groups (POG
), and event (EVENT
) names.
The ParsTwiNER annotations instructions are available in MD format.
Please cite the following paper in your publication if you are using ParsTwiNER in your research:
@inproceedings{aghajani-etal-2021-parstwiner,
title = "{P}ars{T}wi{NER}: A Corpus for Named Entity Recognition at Informal {P}ersian",
author = "Aghajani, MohammadMahdi and
Badri, AliAkbar and
Beigy, Hamid",
booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)",
month = nov,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.wnut-1.16",
pages = "131--136",
abstract = "As a result of unstructured sentences and some misspellings and errors, finding named entities in a noisy environment such as social media takes much more effort. ParsTwiNER contains about 250k tokens, based on standard instructions like MUC-6 or CoNLL 2003, gathered from Persian Twitter. Using Cohen{'}s Kappa coefficient, the consistency of annotators is 0.95, a high score. In this study, we demonstrate that some state-of-the-art models degrade on these corpora, and trained a new model using parallel transfer learning based on the BERT architecture. Experimental results show that the model works well in informal Persian as well as in formal Persian.",
}