-
This project generates synthetic employee data using Python and Faker, stores it in a PostgreSQL database, and performs analytics and machine learning modeling using PySpark and Scikit-learn. It's designed for data engineering and data science practice, focusing on realistic HR-style datasets and workflows.
-
Key features include:
- Synthetic data generation with customizable logic
- PostgreSQL integration
- PySpark data processing and transformations
- Predictive modeling for employee attrition
git clone https://github.com/CamilaJaviera91/sql-mock-data.git
cd your/route/sql-mock-data
pandas
numpy
faker
psycopg2-binary
pyspark
scikit-learn
matplotlib
seaborn
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
-
Create a new database called employees.
-
Generate mock data.
-
Insert mock data into our new schema.
python your/route/sql-mock-data/sql_mock_data.py
python your/route/sql-mock-data/insert.py
Column | Description | Type |
---|---|---|
id | Unique identifier | Integer |
name | full name of the employee | Text |
date_birth | Date of birth if the employee | Date |
department | Department where the employee works | Text |
Employee work email | Text | |
phonenumber | Work phonenumber of the employee | Text |
yearly_salary | Yearly salary in USD | Integer |
city | City where the employee lives | Text |
hire_date | Date when the employee was hired | Date |
termination_date | Date when the employee was fired | Date |
sql-mock-data/
โโโ data/
โ โโโ *.csv # Synthetic employee data files
โโโ images/
โ โโโ pic*.png # Visualizations and example outputs
โโโ python/
โ โโโ sql_mock_data.py # Script to generate synthetic data
โ โโโ insert.py # Script to insert data into PostgreSQL
โ โโโ analysis.py # Data analysis using PySpark
โ โโโ queries.py # SQL queries for data retrieval
โ โโโ show_results.py # Visualization of query results
โ โโโ connection.py # Database connection setup
โโโ sql/
โ โโโ schema.sql # SQL schema definitions
โโโ .gitignore # Specifies files to ignore in Git
โโโ README.md # Project documentation
- PySpark it's the Python API for Apache Spark, enabling the use of Spark with Python.
-
Distributed Computing: Processes large datasets across a cluster of computers for scalability.
-
In-Memory Processing: Speeds up computation by reducing disk I/O.
-
Lazy Evaluation: Operations are only executed when an action is triggered, optimizing performance.
-
Rich Libraries:
- Spark SQL: Structured data processing (like SQL operations).
- MLlib: Machine learning library for scalable algorithms.
- GraphX: Graph processing (via RDD API).
- Spark Streaming: Real-time stream processing.
-
Compatibility: Works with Hadoop, HDFS, Hive, Cassandra, etc.
-
Resilient Distributed Datasets (RDDs): Low-level API for distributed data handling.
-
DataFrames & Datasets: High-level APIs for structured data with SQL-like operations.
Pros | Cons |
---|---|
Handles massive datasets efficiently. | Can be memory-intensive. |
Compatible with many tools (Hadoop, Cassandra, etc.). | Complex configuration for cluster environments. |
Built-in libraries for SQL, Machine Learning. |
- Install via pip
pip install pyspark
- Verify installation
python3 -c "import pyspark; print(pyspark.__version__)"
- SQL is how we read, write, and manage data stored in databases.
- Data Querying: You can retrieve exactly the data you need using the SELECT statement.
SELECT * FROM employees WHERE department = 'HR';
2.Data Manipulation: SQL lets you insert, update, or delete records.
- INSERT
- UPDATE
- DELETE
-
Data Definition: You can create or change the structure of tables and databases.
- CREATE
- ALTER
- DROP
-
Data Control: SQL allows you to control access to the data.
- GRANT
- REVOKE
-
Transaction Control: Manage multiple steps as a single unit.
- BEGIN
- COMMIT
- ROLLBACK
-
Filtering and Sorting:
- WHERE
- ORDER BY
- GROUP BY
- HAVING
-
Joins: Combine data from multiple tables.
-
Built-in Functions: SQL includes powerful functions for calculations, text handling, dates, etc.
-
Standardized Language: SQL is used across most relational database systems (like PostgreSQL, MySQL, SQL Server, etc.), with only slight differences.
-
Declarative Nature: You tell SQL what you want, not how to do it. The database figures out the best way.
Pros | Cons |
---|---|
Easy to Learn and Use. | Not Ideal for Complex Logic. |
Efficient Data Management. | Different Dialects. |
Powerful Querying Capabilities. | Can Get Complicated. |
Standardized Language. | Limited for Unstructured Data. |
Scalable. | Performance Tuning Required. |
Secure. | |
Supports Transactions. |
-
Docker is a tool that lets you package your app with everything it needs, so it can run anywhere, without problems.
-
It does this using something called containers, which are like small, lightweight virtual machines.
-
Containers: Run apps in isolated environments.
-
Images: Blueprints for containers (created using a Dockerfile).
-
Portability: Works the same on any system with Docker.
-
Speed: Starts apps quickly.
-
Docker Hub: A place to share and download app images.
Pros | Cons |
---|---|
Works the same everywhere. | Takes some time to learn. |
Fast and lightweight. | Not ideal for apps that need a full operating system. |
Easy to share apps. | Security risks if not set up properly. |
Good for automating deployments. | Managing data storage can be tricky. |
Great for teams working together. |
- Update the system:
sudo dnf update -y
- Install necessary packages for using HTTPS repositories:
sudo dnf install dnf-plugins-core -y
- Add the official Docker repository:
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
- Install Docker Engine:
sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
- Enable and start the Docker service:
sudo systemctl enable docker
sudo systemctl start docker
- Verify that Docker is running:
sudo docker run hello-world
- (Optional) Run Docker without sudo:
- If you want to use Docker without typing sudo every time:
sudo usermod -aG docker $USER
Then, log out and log back in (or reboot your system) for the change to take effect.
Library | Description |
---|---|
PySpark | Apache Spark Python API (for big data). |
Faker | Fake data generator (used for names, etc.). |
unidecode | Removes accents from characters (e.g., รฉ โ e). |
random | For generating random numbers, probabilities, selections, etc. |
os | For cross-platform file handling and directory management. |
shutil | For managing file system operations in automation scripts. |
-
This script:
-
Creates 1 million fake employee records.
-
Each with realistic personal and job data.
-
Saves them across 12 cleanly named CSV files.
-
Makes sure names and phones are unique.
-
Can be scaled easily or reused for testing, demos, or training.
-
Library | Description |
---|---|
pandas | For working with CSVs and DataFrames. |
os | For cross-platform file handling and directory management. |
random | For generate random numbers, shuffle data, and make random selections. |
-
This script:
-
Reads all .csv files from a folder called data, and saves enriched versions to data_enriched.
-
Reads a list of known female first names from a text file (female_names.txt) to help determine gender.
-
Provides a list of 20 possible job titles for each department like Sales, IT, HR, etc., to assign randomly.
-
For every CSV:
- Adds a status column (Active or Inactive depending on termination_date).
- Adds a gender column using the first name.
- Adds a job_title column based on the department.
-
Writes the enriched data to a new CSV in the data_enriched folder and prints a confirmation.
-
Library | Description |
---|---|
pandas | For working with CSVs and DataFrames. |
sqlalchemy | Python SQL toolkit and ORM. |
psycopg2 | PostgreSQL driver required by SQLAlchemy. |
python-dotenv | helps you load environment variables from .env file. |
glob | Standard library for file pattern matching. |
os | For cross-platform file handling and directory management. |
-
This script:
-
Finds all CSV files in the ./data/ folder using glob.
-
Reads and combines all the CSVs into a single pandas DataFrame.
-
Creates a connection to a PostgreSQL database using SQLAlchemy.
-
Uploads the combined data to the employees table in the database.
-
Library | Description |
---|---|
PySpark | Apache Spark Python API (for big data). |
matplotlib.pyplot | To create visualizations (histograms and bar charts). |
logging | To track execution flow and info messages. |
-
This script:
-
Reads multiple CSV files using PySpark and combines them into a single DataFrame.
-
Calculates the age of each employee based on their date of birth and shows basic statistics.
-
Generates age distribution plots using matplotlib (histogram + bar chart with labels).
-
Performs department and city analysis, including counts and turnover (employees who left).
-
Logs activity and minimizes Spark output verbosity for clarity.
-
Library | Description |
---|---|
psycopg2 | PostgreSQL driver required by SQLAlchemy. |
pandas | For working with CSVs and DataFrames. |
connection | Custom local module to establish DB connection. |
locale | Built-in module for localization/formatting. |
sys | Built-in module to modify the system path. |
-
This script:
-
Uses a custom connection() function to establish a PostgreSQL connection.
-
Tries to set locale to Spanish (es_ES.UTF-8) for formatting purposes.
-
Runs SQL queries using run_query(), returning results as a pandas DataFrame.
-
Includes 6 analysis (more to add) functions by city, department, and age, calculating turnover rates and salaries for active employees.
-
Executes all analyses and prints them when the script is run directly.
-
- by_city()
- by_department()
- by_age()
- salary_by_city()
- salary_by_department()
- salary_by_age()
- hired_and_terminated()
- hired_and_terminated_department()
Library | Description |
---|---|
matplotlib.pyplot | To create visualizations (histograms and bar charts). |
seaborn | For making nice statistical plots easily. |
queries | Custom local module to establish DB connection. |
-
This script:
-
Imports data from predefined SQL queries (like by_city, by_age, etc.) using custom functions.
-
Creates charts with Seaborn and Matplotlib to visualize employee data.
-
Plots bar charts for active employees and salaries by city and department.
-
Plots a line chart showing turnover rate by age, with value labels.
-
Plots a line chart showing yearly hires and terminations, including count labels.
-
- plot_by_city()
- plot_by_department()
- plot_by_age()
- plot_salary_by_city()
- plot_salary_by_department()
- plot_hired_and_terminated()
Library | Description |
---|---|
sys | Built-in module to modify the system path. |
connection | Custom local module to establish DB connection. |
queries | Custom local module to establish DB connection. |
psycopg2 | PostgreSQL driver required by SQLAlchemy. |
pandas | For working with CSVs and DataFrames. |
locale | Built-in module for localization/formatting. |
matplotlib.pyplot | To create visualizations (histograms and bar charts). |
numpy | For working with numerical data, especially arrays/matrices. |
sklearn.linear_model | To predict future values. |
-
This script:
-
Connects to a database and gets data about how many people were hired and fired each year.
-
Learns the trend using machine learning (linear regression).
-
Predicts how many people will be hired and fired in the next 3 years.
-
Shows the results in a table.
-
Draws a chart to compare real and predicted numbers.
-
- Add DBT models for transformation and documentation.
- Streamline data generation for large-scale datasets.
- Add Airflow DAG for orchestration.
- Deploy insights via Looker Studio or Power BI dashboard.