In this project, credit risk is an unbalanced classification problem because good loans outnumber risky loans. In order our analysis would be more accurate, we are implementing different techniques to train and evaluate models with unbalanced classes. A dataset from LendingClub, a peer-to-peer lending services company will be utilize and employ the following:
- Naive Random OverSampler
- SMOTE Oversampling
- Cluster Centoroids Undersampling
- SMOTEENN Combination (Over and Under) Sampling
- Balanced Random Forest Classifier
- Easy Ensemble ADAboost Classifier
- Data Source: LoanStats_2019Q1.csv
- Tools: Jupyter Notebook, MS Excel
- Language: Python
- Python Dependencies: pandas, pathlib, numpy, scikit-lear, imbalanced-learn
Testing over-and under-sampling algorithm. Below is a result of resampling data using SMOTEENN algorithm.
Various machine learning models were utilized to evaluate the most effective in predicting credit risk. In this analysis, the accuracy, precision and sensitiviy were reviewed for each model. The confusion matrix correlates with the result of accuracy, precision and sensitivity.
Each models result differ from one another. The precision score for all the models is overfit therefore it should be combined with recall and accuracy score. It is recommended that the perfect model to utilize in credit risk analysis is the Easy Ensemble AdaBoost Classifier.