🪪 Synthetic Employee Dataset: SQL, PySpark & ML Pipeline

SQL Mock Data

This project generates synthetic employee data using Python and Faker, stores it in a PostgreSQL database, and performs analytics and machine learning modeling using PySpark and Scikit-learn. It's designed for data engineering and data science practice, focusing on realistic HR-style datasets and workflows.
Key features include:
- Synthetic data generation with customizable logic
- PostgreSQL integration
- PySpark data processing and transformations
- Predictive modeling for employee attrition

🚀 Getting Started

1. Clone the repository

git clone https://github.com/CamilaJaviera91/sql-mock-data.git

2. Open the folder in your computer

cd your/route/sql-mock-data

3. Create a file named requirements.txt with the following content:

pandas
numpy
faker
psycopg2-binary
pyspark
scikit-learn
matplotlib
seaborn

4. Create a virtual environment and install dependencies

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

5. Set up the PostgreSQL database

Create a new database called employees.
Generate mock data.
Insert mock data into our new schema.

6. Generate and insert mock data into the database

python your/route/sql-mock-data/sql_mock_data.py

python your/route/sql-mock-data/insert.py

📚 Data Dictionary

Column	Description	Type
id	Unique identifier	Integer
name	full name of the employee	Text
date_birth	Date of birth if the employee	Date
department	Department where the employee works	Text
email	Employee work email	Text
phonenumber	Work phonenumber of the employee	Text
yearly_salary	Yearly salary in USD	Integer
city	City where the employee lives	Text
hire_date	Date when the employee was hired	Date
termination_date	Date when the employee was fired	Date

📁 Project Structure

sql-mock-data/
├── data/
│   └── *.csv                  # Synthetic employee data files
├── images/
│   └── pic*.png               # Visualizations and example outputs
├── python/
│   ├── sql_mock_data.py       # Script to generate synthetic data
│   ├── insert.py              # Script to insert data into PostgreSQL
│   ├── analysis.py            # Data analysis using PySpark
│   ├── queries.py             # SQL queries for data retrieval
│   ├── show_results.py        # Visualization of query results
│   └── connection.py          # Database connection setup
├── sql/
│   └── schema.sql             # SQL schema definitions
├── .gitignore                 # Specifies files to ignore in Git
└── README.md                  # Project documentation

🔥 Introduction to PySpark

PySpark it's the Python API for Apache Spark, enabling the use of Spark with Python.

🔑 Key Features:

Distributed Computing: Processes large datasets across a cluster of computers for scalability.
In-Memory Processing: Speeds up computation by reducing disk I/O.
Lazy Evaluation: Operations are only executed when an action is triggered, optimizing performance.
Rich Libraries:
- Spark SQL: Structured data processing (like SQL operations).
- MLlib: Machine learning library for scalable algorithms.
- GraphX: Graph processing (via RDD API).
- Spark Streaming: Real-time stream processing.
Compatibility: Works with Hadoop, HDFS, Hive, Cassandra, etc.
Resilient Distributed Datasets (RDDs): Low-level API for distributed data handling.
DataFrames & Datasets: High-level APIs for structured data with SQL-like operations.

✅ Pros — ❌ Cons

Pros	Cons
Handles massive datasets efficiently.	Can be memory-intensive.
Compatible with many tools (Hadoop, Cassandra, etc.).	Complex configuration for cluster environments.
Built-in libraries for SQL, Machine Learning.

🔧 Install pyspark

Install via pip

pip install pyspark

Verify installation

python3 -c "import pyspark; print(pyspark.__version__)"

🗃️ Introduction to SQL (Structured Query Language)

SQL is how we read, write, and manage data stored in databases.

🔑 Key Features:

Data Querying: You can retrieve exactly the data you need using the SELECT statement.

SELECT * FROM employees WHERE department = 'HR';

2.Data Manipulation: SQL lets you insert, update, or delete records.

- INSERT
- UPDATE
- DELETE

Data Definition: You can create or change the structure of tables and databases.
- CREATE
- ALTER
- DROP
Data Control: SQL allows you to control access to the data.
- GRANT
- REVOKE
Transaction Control: Manage multiple steps as a single unit.
- BEGIN
- COMMIT
- ROLLBACK
Filtering and Sorting:
- WHERE
- ORDER BY
- GROUP BY
- HAVING
Joins: Combine data from multiple tables.
Built-in Functions: SQL includes powerful functions for calculations, text handling, dates, etc.
Standardized Language: SQL is used across most relational database systems (like PostgreSQL, MySQL, SQL Server, etc.), with only slight differences.
Declarative Nature: You tell SQL what you want, not how to do it. The database figures out the best way.

✅ Pros — ❌ Cons

Pros	Cons
Easy to Learn and Use.	Not Ideal for Complex Logic.
Efficient Data Management.	Different Dialects.
Powerful Querying Capabilities.	Can Get Complicated.
Standardized Language.	Limited for Unstructured Data.
Scalable.	Performance Tuning Required.
Secure.
Supports Transactions.

🐳 Introduction to Docker

Docker is a tool that lets you package your app with everything it needs, so it can run anywhere, without problems.
It does this using something called containers, which are like small, lightweight virtual machines.

🔑 Key Features:

Containers: Run apps in isolated environments.
Images: Blueprints for containers (created using a Dockerfile).
Portability: Works the same on any system with Docker.
Speed: Starts apps quickly.
Docker Hub: A place to share and download app images.

✅ Pros — ❌ Cons

Pros	Cons
Works the same everywhere.	Takes some time to learn.
Fast and lightweight.	Not ideal for apps that need a full operating system.
Easy to share apps.	Security risks if not set up properly.
Good for automating deployments.	Managing data storage can be tricky.
Great for teams working together.

🔧 Install Docker on Fedora

Update the system:

sudo dnf update -y

Install necessary packages for using HTTPS repositories:

sudo dnf install dnf-plugins-core -y

Add the official Docker repository:

sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo

Install Docker Engine:

sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y

Enable and start the Docker service:

sudo systemctl enable docker
sudo systemctl start docker

Verify that Docker is running:

sudo docker run hello-world

(Optional) Run Docker without sudo:

If you want to use Docker without typing sudo every time:

sudo usermod -aG docker $USER

Then, log out and log back in (or reboot your system) for the change to take effect.

🛠️ Code Explanation

👩‍💻 Script 1: sql_mock_data.py — Generate Mock Data

🔧 Libraries that we are going to need:

Library	Description
PySpark	Apache Spark Python API (for big data).
Faker	Fake data generator (used for names, etc.).
unidecode	Removes accents from characters (e.g., é → e).
random	For generating random numbers, probabilities, selections, etc.
os	For cross-platform file handling and directory management.
shutil	For managing file system operations in automation scripts.

📖 Explanation of the Code:

This script:
- Creates 1 million fake employee records.
- Each with realistic personal and job data.
- Saves them across 12 cleanly named CSV files.
- Makes sure names and phones are unique.
- Can be scaled easily or reused for testing, demos, or training.

✅ Example Output:

👩‍💻 Script 2: edit_data.py — Edit Mock Data

🔧 Libraries that we are going to need:

Library	Description
pandas	For working with CSVs and DataFrames.
os	For cross-platform file handling and directory management.
random	For generate random numbers, shuffle data, and make random selections.

📖 Explanation of the Code:

This script:
- Reads all .csv files from a folder called data, and saves enriched versions to data_enriched.
- Reads a list of known female first names from a text file (female_names.txt) to help determine gender.
- Provides a list of 20 possible job titles for each department like Sales, IT, HR, etc., to assign randomly.
- For every CSV:
  - Adds a status column (Active or Inactive depending on termination_date).
  - Adds a gender column using the first name.
  - Adds a job_title column based on the department.
- Writes the enriched data to a new CSV in the data_enriched folder and prints a confirmation.

✅ Example Output:

👩‍💻 Script 3: insert.py — Insert data into postgres

🔧 Libraries that we are going to need:

Library	Description
pandas	For working with CSVs and DataFrames.
sqlalchemy	Python SQL toolkit and ORM.
psycopg2	PostgreSQL driver required by SQLAlchemy.
python-dotenv	helps you load environment variables from `.env` file.
glob	Standard library for file pattern matching.
os	For cross-platform file handling and directory management.

📖 Explanation of the Code:

This script:
- Finds all CSV files in the ./data/ folder using glob.
- Reads and combines all the CSVs into a single pandas DataFrame.
- Creates a connection to a PostgreSQL database using SQLAlchemy.
- Uploads the combined data to the employees table in the database.

✅ Example Output:

---

👩‍💻 Script 4: analysis.py — First analysis of the data

🔧 Libraries that we are going to need:

Library	Description
PySpark	Apache Spark Python API (for big data).
matplotlib.pyplot	To create visualizations (histograms and bar charts).
logging	To track execution flow and info messages.

📖 Explanation of the Code:

This script:
- Reads multiple CSV files using PySpark and combines them into a single DataFrame.
- Calculates the age of each employee based on their date of birth and shows basic statistics.
- Generates age distribution plots using matplotlib (histogram + bar chart with labels).
- Performs department and city analysis, including counts and turnover (employees who left).
- Logs activity and minimizes Spark output verbosity for clarity.

✅ Example Output:

👩‍💻 Script 5: queries.py — Create SQL queries

🔧 Libraries that we are going to need:

Library	Description
psycopg2	PostgreSQL driver required by SQLAlchemy.
pandas	For working with CSVs and DataFrames.
connection	Custom local module to establish DB connection.
locale	Built-in module for localization/formatting.
sys	Built-in module to modify the system path.

📖 Explanation of the Code:

This script:
- Uses a custom connection() function to establish a PostgreSQL connection.
- Tries to set locale to Spanish (es_ES.UTF-8) for formatting purposes.
- Runs SQL queries using run_query(), returning results as a pandas DataFrame.
- Includes 6 analysis (more to add) functions by city, department, and age, calculating turnover rates and salaries for active employees.
- Executes all analyses and prints them when the script is run directly.

✅ Example Output:

by_city()

by_department()

by_age()

salary_by_city()

salary_by_department()

salary_by_age()

hired_and_terminated()

hired_and_terminated_department()

👩‍💻 Script 6: show_results.py — Plot SQL queries

🔧 Libraries that we are going to need:

Library	Description
matplotlib.pyplot	To create visualizations (histograms and bar charts).
seaborn	For making nice statistical plots easily.
queries	Custom local module to establish DB connection.

📖 Explanation of the Code:

This script:
- Imports data from predefined SQL queries (like by_city, by_age, etc.) using custom functions.
- Creates charts with Seaborn and Matplotlib to visualize employee data.
- Plots bar charts for active employees and salaries by city and department.
- Plots a line chart showing turnover rate by age, with value labels.
- Plots a line chart showing yearly hires and terminations, including count labels.

✅ Example Output:

plot_by_city()

plot_by_department()

plot_by_age()

plot_salary_by_city()

plot_salary_by_department()

plot_hired_and_terminated()

👩‍💻 Script 7: prediction.py — Predict employees hired and terminated

🔧 Libraries that we are going to need:

Library	Description
sys	Built-in module to modify the system path.
connection	Custom local module to establish DB connection.
queries	Custom local module to establish DB connection.
psycopg2	PostgreSQL driver required by SQLAlchemy.
pandas	For working with CSVs and DataFrames.
locale	Built-in module for localization/formatting.
matplotlib.pyplot	To create visualizations (histograms and bar charts).
numpy	For working with numerical data, especially arrays/matrices.
sklearn.linear_model	To predict future values.

📖 Explanation of the Code:

This script:
- Connects to a database and gets data about how many people were hired and fired each year.
- Learns the trend using machine learning (linear regression).
- Predicts how many people will be hired and fired in the next 3 years.
- Shows the results in a table.
- Draws a chart to compare real and predicted numbers.

✅ Example Output:

🔮 Future Enhancements

Add DBT models for transformation and documentation.
- Link Repo
Streamline data generation for large-scale datasets.
- Link Repo
Add Airflow DAG for orchestration.
Deploy insights via Looker Studio or Power BI dashboard.
- Link Looker Studio

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
data		data
data_enriched		data_enriched
images		images
python		python
sql		sql
.gitignore		.gitignore
README.md		README.md
female_names.txt		female_names.txt
requirements.txt		requirements.txt

CamilaJaviera91/sql-mock-data

Folders and files

Latest commit

History

Repository files navigation

🪪 Synthetic Employee Dataset: SQL, PySpark & ML Pipeline

SQL Mock Data

🚀 Getting Started

1. Clone the repository

2. Open the folder in your computer

3. Create a file named requirements.txt with the following content:

4. Create a virtual environment and install dependencies

5. Set up the PostgreSQL database

6. Generate and insert mock data into the database

📚 Data Dictionary

📁 Project Structure

🔥 Introduction to PySpark

🔑 Key Features:

✅ Pros — ❌ Cons

🔧 Install pyspark

🗃️ Introduction to SQL (Structured Query Language)

🔑 Key Features:

✅ Pros — ❌ Cons

🐳 Introduction to Docker

🔑 Key Features:

✅ Pros — ❌ Cons

🔧 Install Docker on Fedora

🛠️ Code Explanation

👩‍💻 Script 1: sql_mock_data.py — Generate Mock Data

🔧 Libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

👩‍💻 Script 2: edit_data.py — Edit Mock Data

🔧 Libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

👩‍💻 Script 3: insert.py — Insert data into postgres

🔧 Libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

👩‍💻 Script 4: analysis.py — First analysis of the data

🔧 Libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

👩‍💻 Script 5: queries.py — Create SQL queries

🔧 Libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

👩‍💻 Script 6: show_results.py — Plot SQL queries

🔧 Libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

👩‍💻 Script 7: prediction.py — Predict employees hired and terminated

🔧 Libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

🔮 Future Enhancements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages