This research explores the use of machine learning (ML) models to identify the most important biomarkers for diagnosing cancer. The data used in this research is an extremely high-dimensional dataset that represents various cancer biomarkers.
The dataset used in this research contains a large number of samples and features. It includes measurements of gene expressions and protein concentrations of various cancer biomarkers. The data was obtained from a publicly available database.
This study employed various classes of ML models, including linear, non-linear models and ensembles, to identify the most important biomarkers for cancer diagnosis. The performance of these models was compared using standard evaluation metrics such as accuracy, precision, recall, and F1 score (macro & micro)
Feature selection techniques were applied across filters and wrappers types, including a novel feature selection approach. The purpose of feature selection was to identify the most relevant features that contribute to the accuracy of the models. The results of the different methods are discussed in the paper.
To run the project, follow these steps:
- Clone the repository:
git clone https://github.com/Adeyeha/Cancer-Biomarkers-ML.git
- Install Python 3.x
- Install the required dependencies:
- pandas:
pip install pandas
- scikit-learn:
pip install scikit-learn
- lazypredict:
pip install lazypredict
- seaborn:
pip install seaborn
- pandas:
This repository contains a series of Jupyter notebooks demonstrating the process of identifying critical biomarkers for cancer diagnosis using machine learning techniques. The notebooks cover various stages, including data preprocessing, feature selection, model training, and evaluation.
-
Description: This notebook delves into the dataset, conducts data cleaning, and visualizes key insights using Matplotlib and Seaborn. It establishes baseline models on the processed dataset, which serve as benchmarks for subsequent experiments.
-
Link: Notebook 1 - Baseline Models .
-
Description: This notebook emphasizes the implementation of feature selection through filter methods and evaluates these methods in comparison to the established baseline.
-
Link: Notebook 2 - Filter Methods .
-
Description: This notebook focuses on the practical application of feature selection using wrapper methods. It assesses the performance of these methods relative to the baseline.
-
Link: Notebook 3 - Wrapper Methods .
-
Description: This notebook concentrates on feature selection through embedded methods and evaluates their effectiveness compared to the baseline.
-
Link: Notebook 4 - Embedded Methods .
-
Description: This notebook showcases the implementation of Sequential Feature Selection.
-
Description: This notebook provides insights into Recursive Feature Elimination with Stability Selection.
-
Description: These notebooks comprehensively compare all the aforementioned feature selection methods.
-
Link: Notebook 8 - Final Output .
MIT
If you want to contribute to this project, please create a pull request with a detailed description of your changes.
- Temitope Adeyeha
- Bikram Sahoo