We use various libraries and tools for cleaning and preprocessing the poem dataset (link to dataset).
The tools are:
- haraai_clean.py: ParsiNorm
- hazm normalizer
To use the clean function, you can use the following code:
from main_clean import clean
with open('/content/anvari.txt') as fp:
texts = fp.readlines()
list_clean = clean(texts[0:15])
print(list_clean)
We have collected several models for Persian and multilingual (supporting Persian) tokenization and text classification tasks.
This data sources were collected from
- Ganjoor: Ganjoor Text Corpus
- Kaggle: Large Corpus of Farsi Poems
- amnghd: Persian Poems Corpus