This repository contains the code of our published work in the IEEE Journal of Biomedical and Health Informatics: "Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets". The main objective of this work was to demonstrate the feasibility of the employment of synthetic data to train Machine Learning models validated with real medical tabular data for classification tasks. This, without harming the statistical properties of the original data. With this goal, an in-depth analysis of the relationship between the amount of synthetic data samples, classification performance, and statistical similarity metrics was performed.
There are 8 folders, one for each database
. Inside each folder there are two files: database_utils.py
and database_main.py
(being database
the correspondant name for each database). The former contains constants and custom functions developed uniquely for that database. The latter contains the most important part of this work; the script with the framework itself. One script has been developed for each database due to the heterogeneity and particularities of all databases. The arguments/parameters of every script are on the top of them. Notice that, with the default parameters (bal_iterations = 100
, aug_iterations = 10
), and the current grid of Machine Learning models parameters (see svm_params
, rf_params
, xgb_params
, knn_params
variables) execution time can last from around 6 hours to nearly a day, depending on the database. Reduction of iterations and/or grid parameters will reduce the execution time.
results
folders contain the most relevant results, most of them published in our work. Further executions of this code will overwrite the original results if neither the folder nor the file names are properly changed within the code. EDA
folders has not been included yet, even they are generated when executing this code, since some errors arise when dealing with categorical variables. With PIMA
and SACardio
databases Exploratory Data Analysis (EDA) functions work because these datasets do not contain categorical variables.
Obtained results demonstrate that, using CTGAN and a Gaussian Copula available at the SDV library, classification performances can be perfectly maintained, and even improved in some cases. Further research must be done in this line, yet the results presented in our work are promising.
Please cite our paper if this framework somehow helped you in your research and/or development work, or if you used this piece of code:
A. J. Rodriguez-Almeida et al., "Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets," in IEEE Journal of Biomedical and Health Informatics, 2022, doi: 10.1109/JBHI.2022.3196697.
8 different databases has been used to test this framework. Most of them (6) are publicly available. The rest (2) are available under request to the authors. Aiming replicability of this expermient, links to the databases (or to the reference, when data is not freely available), are provided below:
- MNCD
- MNCD-Reduced (version with more patients and less features than MNCD)
- Bangladesh
- Early Diabetes Mellitus
- Heart Disease
- Kidney Chronic Disease
- Diabetes PIMA Database
- South Africa Cardio
-
Install conda package manager.
-
Clone this repository.
-
Enter the
synthetic_data_generation
directory -
Create the environment with the proper Python version by running:
conda create -n SDG python=3.8.13
-
Activate the already installed envirionment by running:
conda activate SGD
-
Install the required packages. Solutions for this using
requirements.txt
orenvironments.yml
has been tested butsdv
package show conflicts due to cross-dependencies with other libraries. So, packages sould be installed manually. Notice that the installed versions are not the most recent ones, but the ones that have been employed to develop and test this framework. In your terminal run (one line by one):conda install scikit-learn=1.0.2 conda install pandas=1.1.3 conda install numpy=1.21.5 conda install -c conda-forge imbalanced-learn=0.7.0 conda install matplotlib=3.5.2 conda install -c pytorch -c conda-forge sdv=0.14.1 conda install openpyxl=3.0.9
-
Download the databases and set the
DATASET_PATH
in alldatasetname_main.py
files according to your own path. Check also thatfilename
variable contains the actual file name of the database. Finally, set theDICT_PATH
variable in alldatasetname_utils.py
to store the dictionaries that contain the results properly.
where datasetname
corresponds to the abovementioned datasets names.
To execute the whole experiment, with the default settings, these are the lines of code you must type in your Python terminal. Changes in the code will be introduced to input the parameters from the terminal. From the synthetic_data_generation
folder:
cd DATASET_FOLDER
python DATASET_NAME_main.py
where DATASET_FOLDER
must be replaced by the folder correspondant to the dataset (e.g., PIMA
) and DATASET_NAME
must be replaced by one of the 8 used databases (e.g., PIMA
). Notice that this lines must be executed one time per database to obtain all the results.
Once the results have been already generated, one can visualize some figures after loading the obtained results without the need of re-executing everything. If you do not have Latex
installed in your PC, please do one of the following things:
- Install it.
- Comment every line that contains
plt.style.use(['science','ieee'])
within the code.
The choice of the dataset to be analyzed must be done inside the gen_and_save.py
file for now, as indicated in the comments of such file. Specifically, STUDIED_DATABASE
variable must be properly set. Changes will be introduced to input the parameters from the terminal. Afterwards, execute this line from the main folder:
python gen_and_save_figs.py
As previously outlined, due to their particularities, each database has its own script. The execution of each of them will generate an EDA
and results
folders that contain the initial EDA and the results after data augmentation, respectively. Results are stored as figures, as .pkl
files and/or as .txt
files containing the numerical values of the metrics analysed. Please, refer to our paper for further information regarding the studied metrics and obtained results.
For any other questions related with the code or the synthetic data framework itself, you can post an issue on this repository or contact me via email.