This repo contains a pipeline script that collects resume/job vacancy data from the open data portal Jobs in Russia and puts it into the Postgres database in a quasi-normalized form.
The main file is pipline.py
, there are two parameters inside it:
- monthly=True - collect the archive only for the last day of each available month.
- remove_gz=True - subsequent deletion of all downloaded archives.
Access to the database should be configured in /src/config_to_bd.yml
.
The name of the temporary folder, e.g. workdir, must be specified in the working_directory: "./workdir"
line in all_tables_names.yml
.
There are two data preparation SQL/py scripts located in the datasets
folder that create cleaned aggregates for:
- A dataset on educational and career trajectories
- A dataset on the activity of unemployed candidates
The execution order:
- execute
dataset1.sql
anddataset2.sql
to create a raw data table - manually export the tables to
dataset1.csv
anddataset2.csv
with headers - execute
dataset1.py
anddataset2.py
to get the cleaned versions, namelydataset1.csv.clean.csv
anddataset2.csv.clean.csv
- execute
dataset1.check.py
anddataset2.check.py
to finalize datasets - execute
dataset2.clusters.py
to add the cluster column
We adapted a pipeline that was originally developed by researchers from CPUR/INID.
All raw data available in the source are distributed in the public domain and are completely free of charge based on the principles of use by the Russian Government (see). The data that was obtained as a result of our scripts are distributed under CC-BY-4.0 (see). All code is provided under the MIT license (see) and can also be used freely.
Valko, D., Vasilevskaia, M., Bunina, M., Kozlova, M., Filippova, A. M., & Rud, D. (2024). Educational and Career Trajectories in Russia: Introducing a New Source and Datasets with a High Granularity. Research Data Journal for the Humanities and Social Sciences. https://doi.org/10.1163/24523666-bja10046
@article{Valkoetal2024,
author = "Danila Valko and Mariia Vasilevskaia and Maria Bunina and Mariia Kozlova and Anna Maria Filippova and Daria Rud",
title = "Educational and Career Trajectories in Russia: Introducing a New Source and Datasets with a High Granularity",
journal = "Research Data Journal for the Humanities and Social Sciences",
year = "2024",
publisher = "Brill",
address = "Leiden, The Netherlands",
doi = "10.1163/24523666-bja10046",
pages = "1 - 14",
url = "https://brill.com/view/journals/rdj/aop/article-10.1163-24523666-bja10046/article-10.1163-24523666-bja10046.xml"
}
- Valko, D., Vasilevskaia, M., Bunina, M., Kozlova, M., Filippova, A. M., & Rud, D. (2024). Educational and Career Trajectories in Russia: Introducing a New Source and Datasets with a High Granularity. Research Data Journal for the Humanities and Social Sciences. https://doi.org/10.1163/24523666-bja10046
- Valko, D., Vasilevskaia, M., Bunina, M., Kozlova, M., Filippova, A. M., & Rud, D. (2024). Educational and Career Trajectories in Russia: Open Datasets. Zenodo. https://doi.org/10.5281/zenodo.10913325