GitHub

Ganjoor: Ganjoor Text Corpus
Kaggle: <a href="https://www.kaggle.com/datasets/aminghd/large-corpus-of-farsi-poems?select=shahriar_norm.txt" rel="nofollow">Large Corpus of Farsi Poems
amnghd: <a href="https://github.com/amnghd/Persian_poems_corpus/tree/master/original">Persian Poems Corpus

Clean Function for Dataset Folder

We use various libraries and tools for cleaning and preprocessing the poem dataset (link to dataset).

The tools are:

To use the clean function, you can use the following code:

from main_clean import clean

with open('/content/anvari.txt') as fp:
    texts = fp.readlines()

list_clean = clean(texts[0:15])
print(list_clean)

We have collected several models for Persian and multilingual (supporting Persian) tokenization and text classification tasks.

This data sources were collected from

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Clean Functions Dataset		Clean Functions Dataset
Customer_Flight_PersianAgent		Customer_Flight_PersianAgent
Persian_Poem_Datasets		Persian_Poem_Datasets
Persian_model_tools		Persian_model_tools
neural_network_exercise		neural_network_exercise
.gitattributes		.gitattributes
README.md		README.md
Simple_Agent_In_Persian.ipynb		Simple_Agent_In_Persian.ipynb
micrograd_exercises_GenAI_class.ipynb		micrograd_exercises_GenAI_class.ipynb
nano_gpt.ipynb		nano_gpt.ipynb
persian_nano_gpt.ipynb		persian_nano_gpt.ipynb