This project classifies individuals' obesity levels based on various health and lifestyle-related attributes, such as age, gender, height, weight, and BMI (Body Mass Index). By utilizing machine learning algorithms, this project aims to predict obesity levels to aid in understanding potential health risks associated with obesity.
- Dataset
- Data Preprocessing
- Exploratory Data Analysis (EDA)
- Machine Learning Algorithms
- Model Evaluation
- Usage
- Project Structure
- Dependencies
- Contributing
- License
The dataset used in this project comes from the Obesity Classification Dataset. It includes attributes related to personal characteristics and health status.
- ID: Unique identifier for each individual
- Age: Age in years
- Gender: Male or Female
- Height: Height in centimeters
- Weight: Weight in kilograms
- BMI: Body Mass Index
- Label: Obesity classification (e.g., Underweight, Normal Weight, Overweight, Obese)
Data preprocessing steps include:
- Handling Missing Values: Ensuring no missing entries in the dataset.
- Encoding Categorical Variables: Converting categorical features into numerical formats.
- Normalizing Numerical Features: Standardizing numerical values to improve model performance.
- Splitting the Dataset: Dividing data into training and testing sets for model validation.
EDA was performed to understand feature distributions and relationships with the target label:
- Visualizations: Histograms, box plots, and correlation matrices were used to explore the data.
- Summary Statistics: Mean, median, and distribution checks were conducted.
The following models were evaluated for obesity classification:
- Linear Support Vector Classifier (SVC): Efficient for linearly separable classes, providing quick results for binary or multiclass classification.
- K-Nearest Neighbors (KNN): A simple instance-based classifier, suitable for smaller datasets and capturing local data patterns.
- Random Forest Classifier: An ensemble approach that reduces overfitting, effectively handling complex relationships.
- HistGradientBoosting Classifier: A sequential boosting model that refines errors, often outperforming simpler classifiers in complex cases.
Random Forest Classifer
and HistGradientBoosting Classifier
's performance was optimized using hyperparameter tuning to achieve the best results for this dataset.
Evaluation metrics include:
- Accuracy: The primary metric for overall correctness.
- Precision, Recall, and F1-Score: Used to understand model performance on each class.
- Confusion Matrix: Provides a detailed view of classification performance across obesity classes.
To replicate the analysis and model training:
-
Clone the repository:
git clone https://github.com/otuemre/obesity-classification.git
-
Navigate to the project directory:
cd obesity-classification
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the Jupyter Notebook:
jupyter notebook notebooks/obesity-classification.ipynb
- data/: Folder containing the dataset.
- notebooks/: Contains Jupyter Notebook(s) for data analysis, feature engineering, and model training.
- images/: Folder containing visualization graphs generated during data analysis.
- models/: Folder where final models are saved as .pkl and .joblib files.
- README.md: Project documentation.
- LICENSE.md: License information.
This project relies on the following Python libraries:
- NumPy: For numerical operations and array handling.
- Pandas: For data manipulation and analysis.
- Matplotlib and Seaborn: For creating visualizations and plots.
- Scikit-Learn: For implementing machine learning algorithms.
- Joblib: For saving and loading model files.
- Pickle: For saving and loading model files.
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License. See the LICENSE.md file for details.
Note: This project uses the Obesity Classification Dataset from Kaggle. Ensure compliance with the dataset's license and terms of use: Kaggle Dataset.