Skip to content

Generate a synthetic dataset with one million records of employee information from a fictional company, load it into a PostgreSQL database, create analytical reports using PySpark and large-scale data analysis techniques, and implement machine learning models to predict trends in hiring and layoffs on a monthly and yearly basis.

Notifications You must be signed in to change notification settings

CamilaJaviera91/sql-mock-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿชช Synthetic Employee Dataset: SQL, PySpark & ML Pipeline

SQL Mock Data

  • This project generates synthetic employee data using Python and Faker, stores it in a PostgreSQL database, and performs analytics and machine learning modeling using PySpark and Scikit-learn. It's designed for data engineering and data science practice, focusing on realistic HR-style datasets and workflows.

  • Key features include:

    • Synthetic data generation with customizable logic
    • PostgreSQL integration
    • PySpark data processing and transformations
    • Predictive modeling for employee attrition

๐Ÿš€ Getting Started

1. Clone the repository

git clone https://github.com/CamilaJaviera91/sql-mock-data.git

2. Open the folder in your computer

cd your/route/sql-mock-data

3. Create a file named requirements.txt with the following content:

pandas
numpy
faker
psycopg2-binary
pyspark
scikit-learn
matplotlib
seaborn

4. Create a virtual environment and install dependencies

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

5. Set up the PostgreSQL database

  1. Create a new database called employees.

  2. Generate mock data.

  3. Insert mock data into our new schema.

6. Generate and insert mock data into the database

python your/route/sql-mock-data/sql_mock_data.py

python your/route/sql-mock-data/insert.py

๐Ÿ“š Data Dictionary

Column Description Type
id Unique identifier Integer
name full name of the employee Text
date_birth Date of birth if the employee Date
department Department where the employee works Text
email Employee work email Text
phonenumber Work phonenumber of the employee Text
yearly_salary Yearly salary in USD Integer
city City where the employee lives Text
hire_date Date when the employee was hired Date
termination_date Date when the employee was fired Date

๐Ÿ“ Project Structure

sql-mock-data/
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ *.csv                  # Synthetic employee data files
โ”œโ”€โ”€ images/
โ”‚   โ””โ”€โ”€ pic*.png               # Visualizations and example outputs
โ”œโ”€โ”€ python/
โ”‚   โ”œโ”€โ”€ sql_mock_data.py       # Script to generate synthetic data
โ”‚   โ”œโ”€โ”€ insert.py              # Script to insert data into PostgreSQL
โ”‚   โ”œโ”€โ”€ analysis.py            # Data analysis using PySpark
โ”‚   โ”œโ”€โ”€ queries.py             # SQL queries for data retrieval
โ”‚   โ”œโ”€โ”€ show_results.py        # Visualization of query results
โ”‚   โ””โ”€โ”€ connection.py          # Database connection setup
โ”œโ”€โ”€ sql/
โ”‚   โ””โ”€โ”€ schema.sql             # SQL schema definitions
โ”œโ”€โ”€ .gitignore                 # Specifies files to ignore in Git
โ””โ”€โ”€ README.md                  # Project documentation

๐Ÿ”ฅ Introduction to PySpark

  • PySpark it's the Python API for Apache Spark, enabling the use of Spark with Python.

๐Ÿ”‘ Key Features:

  1. Distributed Computing: Processes large datasets across a cluster of computers for scalability.

  2. In-Memory Processing: Speeds up computation by reducing disk I/O.

  3. Lazy Evaluation: Operations are only executed when an action is triggered, optimizing performance.

  4. Rich Libraries:

    • Spark SQL: Structured data processing (like SQL operations).
    • MLlib: Machine learning library for scalable algorithms.
    • GraphX: Graph processing (via RDD API).
    • Spark Streaming: Real-time stream processing.
  5. Compatibility: Works with Hadoop, HDFS, Hive, Cassandra, etc.

  6. Resilient Distributed Datasets (RDDs): Low-level API for distributed data handling.

  7. DataFrames & Datasets: High-level APIs for structured data with SQL-like operations.

โœ… Pros โ€” โŒ Cons

Pros Cons
Handles massive datasets efficiently. Can be memory-intensive.
Compatible with many tools (Hadoop, Cassandra, etc.). Complex configuration for cluster environments.
Built-in libraries for SQL, Machine Learning.

๐Ÿ”ง Install pyspark

  1. Install via pip
pip install pyspark
  1. Verify installation
python3 -c "import pyspark; print(pyspark.__version__)"

๐Ÿ—ƒ๏ธ Introduction to SQL (Structured Query Language)

  • SQL is how we read, write, and manage data stored in databases.

๐Ÿ”‘ Key Features:

  1. Data Querying: You can retrieve exactly the data you need using the SELECT statement.
SELECT * FROM employees WHERE department = 'HR';

2.Data Manipulation: SQL lets you insert, update, or delete records.

- INSERT
- UPDATE
- DELETE
  1. Data Definition: You can create or change the structure of tables and databases.

    • CREATE
    • ALTER
    • DROP
  2. Data Control: SQL allows you to control access to the data.

    • GRANT
    • REVOKE
  3. Transaction Control: Manage multiple steps as a single unit.

    • BEGIN
    • COMMIT
    • ROLLBACK
  4. Filtering and Sorting:

    • WHERE
    • ORDER BY
    • GROUP BY
    • HAVING
  5. Joins: Combine data from multiple tables.

  6. Built-in Functions: SQL includes powerful functions for calculations, text handling, dates, etc.

  7. Standardized Language: SQL is used across most relational database systems (like PostgreSQL, MySQL, SQL Server, etc.), with only slight differences.

  8. Declarative Nature: You tell SQL what you want, not how to do it. The database figures out the best way.

โœ… Pros โ€” โŒ Cons

Pros Cons
Easy to Learn and Use. Not Ideal for Complex Logic.
Efficient Data Management. Different Dialects.
Powerful Querying Capabilities. Can Get Complicated.
Standardized Language. Limited for Unstructured Data.
Scalable. Performance Tuning Required.
Secure.
Supports Transactions.

๐Ÿณ Introduction to Docker

  • Docker is a tool that lets you package your app with everything it needs, so it can run anywhere, without problems.

  • It does this using something called containers, which are like small, lightweight virtual machines.

๐Ÿ”‘ Key Features:

  1. Containers: Run apps in isolated environments.

  2. Images: Blueprints for containers (created using a Dockerfile).

  3. Portability: Works the same on any system with Docker.

  4. Speed: Starts apps quickly.

  5. Docker Hub: A place to share and download app images.

โœ… Pros โ€” โŒ Cons

Pros Cons
Works the same everywhere. Takes some time to learn.
Fast and lightweight. Not ideal for apps that need a full operating system.
Easy to share apps. Security risks if not set up properly.
Good for automating deployments. Managing data storage can be tricky.
Great for teams working together.

๐Ÿ”ง Install Docker on Fedora

  1. Update the system:
sudo dnf update -y
  1. Install necessary packages for using HTTPS repositories:
sudo dnf install dnf-plugins-core -y
  1. Add the official Docker repository:
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
  1. Install Docker Engine:
sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
  1. Enable and start the Docker service:
sudo systemctl enable docker
sudo systemctl start docker
  1. Verify that Docker is running:
sudo docker run hello-world
  1. (Optional) Run Docker without sudo:
  • If you want to use Docker without typing sudo every time:
sudo usermod -aG docker $USER

Then, log out and log back in (or reboot your system) for the change to take effect.


๐Ÿ› ๏ธ Code Explanation

๐Ÿ‘ฉโ€๐Ÿ’ป Script 1: sql_mock_data.py โ€” Generate Mock Data

๐Ÿ”ง Libraries that we are going to need:

Library Description
PySpark Apache Spark Python API (for big data).
Faker Fake data generator (used for names, etc.).
unidecode Removes accents from characters (e.g., รฉ โ†’ e).
random For generating random numbers, probabilities, selections, etc.
os For cross-platform file handling and directory management.
shutil For managing file system operations in automation scripts.

๐Ÿ“– Explanation of the Code:

  • This script:

    • Creates 1 million fake employee records.

    • Each with realistic personal and job data.

    • Saves them across 12 cleanly named CSV files.

    • Makes sure names and phones are unique.

    • Can be scaled easily or reused for testing, demos, or training.

โœ… Example Output:

mock_data


๐Ÿ‘ฉโ€๐Ÿ’ป Script 2: edit_data.py โ€” Edit Mock Data

๐Ÿ”ง Libraries that we are going to need:

Library Description
pandas For working with CSVs and DataFrames.
os For cross-platform file handling and directory management.
random For generate random numbers, shuffle data, and make random selections.

๐Ÿ“– Explanation of the Code:

  • This script:

    • Reads all .csv files from a folder called data, and saves enriched versions to data_enriched.

    • Reads a list of known female first names from a text file (female_names.txt) to help determine gender.

    • Provides a list of 20 possible job titles for each department like Sales, IT, HR, etc., to assign randomly.

    • For every CSV:

      • Adds a status column (Active or Inactive depending on termination_date).
      • Adds a gender column using the first name.
      • Adds a job_title column based on the department.
    • Writes the enriched data to a new CSV in the data_enriched folder and prints a confirmation.

โœ… Example Output:

mock_data


๐Ÿ‘ฉโ€๐Ÿ’ป Script 3: insert.py โ€” Insert data into postgres

๐Ÿ”ง Libraries that we are going to need:

Library Description
pandas For working with CSVs and DataFrames.
sqlalchemy Python SQL toolkit and ORM.
psycopg2 PostgreSQL driver required by SQLAlchemy.
python-dotenv helps you load environment variables from .env file.
glob Standard library for file pattern matching.
os For cross-platform file handling and directory management.

๐Ÿ“– Explanation of the Code:

  • This script:

    • Finds all CSV files in the ./data/ folder using glob.

    • Reads and combines all the CSVs into a single pandas DataFrame.

    • Creates a connection to a PostgreSQL database using SQLAlchemy.

    • Uploads the combined data to the employees table in the database.

โœ… Example Output:

mock_data

---

๐Ÿ‘ฉโ€๐Ÿ’ป Script 4: analysis.py โ€” First analysis of the data

๐Ÿ”ง Libraries that we are going to need:

Library Description
PySpark Apache Spark Python API (for big data).
matplotlib.pyplot To create visualizations (histograms and bar charts).
logging To track execution flow and info messages.

๐Ÿ“– Explanation of the Code:

  • This script:

    • Reads multiple CSV files using PySpark and combines them into a single DataFrame.

    • Calculates the age of each employee based on their date of birth and shows basic statistics.

    • Generates age distribution plots using matplotlib (histogram + bar chart with labels).

    • Performs department and city analysis, including counts and turnover (employees who left).

    • Logs activity and minimizes Spark output verbosity for clarity.

โœ… Example Output:

mock_data


mock_data


mock_data


mock_data


mock_data


๐Ÿ‘ฉโ€๐Ÿ’ป Script 5: queries.py โ€” Create SQL queries

๐Ÿ”ง Libraries that we are going to need:

Library Description
psycopg2 PostgreSQL driver required by SQLAlchemy.
pandas For working with CSVs and DataFrames.
connection Custom local module to establish DB connection.
locale Built-in module for localization/formatting.
sys Built-in module to modify the system path.

๐Ÿ“– Explanation of the Code:

  • This script:

    • Uses a custom connection() function to establish a PostgreSQL connection.

    • Tries to set locale to Spanish (es_ES.UTF-8) for formatting purposes.

    • Runs SQL queries using run_query(), returning results as a pandas DataFrame.

    • Includes 6 analysis (more to add) functions by city, department, and age, calculating turnover rates and salaries for active employees.

    • Executes all analyses and prints them when the script is run directly.

โœ… Example Output:

  • by_city()

mock_data

  • by_department()

mock_data

  • by_age()

mock_data

  • salary_by_city()

mock_data

  • salary_by_department()

mock_data

  • salary_by_age()

mock_data

  • hired_and_terminated()

mock_data

  • hired_and_terminated_department()

mock_data


๐Ÿ‘ฉโ€๐Ÿ’ป Script 6: show_results.py โ€” Plot SQL queries

๐Ÿ”ง Libraries that we are going to need:

Library Description
matplotlib.pyplot To create visualizations (histograms and bar charts).
seaborn For making nice statistical plots easily.
queries Custom local module to establish DB connection.

๐Ÿ“– Explanation of the Code:

  • This script:

    • Imports data from predefined SQL queries (like by_city, by_age, etc.) using custom functions.

    • Creates charts with Seaborn and Matplotlib to visualize employee data.

    • Plots bar charts for active employees and salaries by city and department.

    • Plots a line chart showing turnover rate by age, with value labels.

    • Plots a line chart showing yearly hires and terminations, including count labels.

โœ… Example Output:

  • plot_by_city()

mock_data

  • plot_by_department()

mock_data

  • plot_by_age()

mock_data

  • plot_salary_by_city()

mock_data

  • plot_salary_by_department()

mock_data

  • plot_hired_and_terminated()

mock_data


๐Ÿ‘ฉโ€๐Ÿ’ป Script 7: prediction.py โ€” Predict employees hired and terminated

๐Ÿ”ง Libraries that we are going to need:

Library Description
sys Built-in module to modify the system path.
connection Custom local module to establish DB connection.
queries Custom local module to establish DB connection.
psycopg2 PostgreSQL driver required by SQLAlchemy.
pandas For working with CSVs and DataFrames.
locale Built-in module for localization/formatting.
matplotlib.pyplot To create visualizations (histograms and bar charts).
numpy For working with numerical data, especially arrays/matrices.
sklearn.linear_model To predict future values.

๐Ÿ“– Explanation of the Code:

  • This script:

    • Connects to a database and gets data about how many people were hired and fired each year.

    • Learns the trend using machine learning (linear regression).

    • Predicts how many people will be hired and fired in the next 3 years.

    • Shows the results in a table.

    • Draws a chart to compare real and predicted numbers.

โœ… Example Output:

mock_data

mock_data


๐Ÿ”ฎ Future Enhancements

  • Add DBT models for transformation and documentation.
  • Streamline data generation for large-scale datasets.
  • Add Airflow DAG for orchestration.
  • Deploy insights via Looker Studio or Power BI dashboard.

About

Generate a synthetic dataset with one million records of employee information from a fictional company, load it into a PostgreSQL database, create analytical reports using PySpark and large-scale data analysis techniques, and implement machine learning models to predict trends in hiring and layoffs on a monthly and yearly basis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages