Prediction-Sociolinguistic-Data-Based-on-the-Diaries-Texts-of-the-Prozhito-Project

The repository contains notebooks created for collecting and preprocessing the corpus of diary entries and for experiments on creating DL and ML models for predicting gender, age groups of authors and the time period of text creation. The work was carried out within the framework of a master's thesis on the topic "Automatic identification of sociolinguistic data based on the texts of diaries of the "Prozhito" project".

This study is dedicated to creation a sociolinguistic profile of text authors through the prediction of hidden demographic attributes, such as gender, age, and the time period of text creation, using machine and deep learning methods. The research material consists of diary entries from the "Prozhito" project – a digital archive of personal documents. The goal of the study is to select the most accurate algorithms for predicting gender, age groups, and the time of text creation based on the analysis of diary entries.

The study employed modeling methods for algorithm development, experiments to test model efficacy, comparisons of different approaches, statistical and sociolinguistic analysis, and the scientific description method. The research identified significant correlations between linguistic features and demographic attributes of the authors and demonstrated high model accuracy, especially using logistic regression and recurrent neural networks combined with a CNN1D architecture. The practical significance of the work lies in the development of models for predicting demographic attributes that can be applied in various fields, from sociology to marketing and forensic examinations. The results are important for programs aimed at preserving historical and cultural texts and contribute to a deeper understanding of linguistic variations and social differences.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
age_prediction.ipynb		age_prediction.ipynb
clean_corpus.ipynb		clean_corpus.ipynb
gender_prediction.ipynb		gender_prediction.ipynb
get_full_dataset.ipynb		get_full_dataset.ipynb
get_tokens_with_past_verbs.ipynb		get_tokens_with_past_verbs.ipynb
time_period_text_creation_prediction.ipynb		time_period_text_creation_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction-Sociolinguistic-Data-Based-on-the-Diaries-Texts-of-the-Prozhito-Project

About

Releases

Packages

Languages

vlada-pv/Prediction-Sociolinguistic-Data-Based-on-the-Diaries-Texts-of-the-Prozhito-Project

Folders and files

Latest commit

History

Repository files navigation

Prediction-Sociolinguistic-Data-Based-on-the-Diaries-Texts-of-the-Prozhito-Project

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages