This project aims to predict loan defaults by analyzing historical customer data from Home Credit. The goal is to identify key factors influencing loan repayment behavior and build a predictive model to classify eligible vs. ineligible borrowers. The project includes:
- Data Analysis: Comprehensive exploration, cleaning, and feature engineering across 7 datasets.
- Predictive Modeling: Development and comparison of multiple machine learning models (XGBoost, LightGBM, CatBoost, Random Forest).
- Model Deployment: Selection of the best-performing model for production use.
- Data Analysis Report: Assess data quality, handle missing values/outliers, and derive actionable insights.
- Predictive Model: Identify customer segments eligible for loans using engineered features.
- Model Comparison: Evaluate performance metrics (accuracy, recall, F1-score) to select the optimal model.
- Challenges Report: Document technical hurdles and solutions during data processing and modeling.
The data comprises 7 interconnected CSV files with loan application history, credit bureau records, and transactional behavior:
Dataset | Description | Rows | Key Features |
---|---|---|---|
application_train |
Main dataset with loan status (Target: 1 =Default, 0 =Non-Default) |
307,511 | DAYS_BIRTH , AMT_CREDIT , EXT_SOURCE_1/2/3 |
bureau |
Clients' previous credits from other institutions | 1,716,428 | DAYS_CREDIT , AMT_CREDIT_SUM |
bureau_balance |
Monthly balances of previous credits | 27,299,925 | MONTHS_BALANCE , STATUS |
POS_CASH_balance |
Monthly POS/cash loan snapshots | 10,001,358 | SK_DPD , CNT_INSTALMENT |
credit_card_balance |
Monthly credit card balance history | 3,840,312 | AMT_BALANCE , CREDIT_UTILIZATION_RATIO |
previous_application |
Clients' prior loan applications | 1,670,214 | AMT_ANNUITY , NAME_CONTRACT_TYPE |
installments_payments |
Repayment history for previous credits | 13,605,401 | AMT_INSTALMENT , DAYS_ENTRY_PAYMENT |
Download Link: PRCP-1006-HomeLoanDef.zip
├── data/ # Raw and processed data
│ ├── 1.1 raw/ # Original CSV files
│ └── 1.2 processed/ # Cleaned and aggregated data (final_data.csv)
├── docs/ # Problem statement and dataset documentation
├── notebooks/ # Jupyter notebooks for analysis and modeling
│ └── PRCP-1006-HomeLoanDef.ipynb # Main workflow
├── reports/ # Final reports (Markdown/PDF)
├── results/ # Analysis outputs, figures, and saved models
│ ├── 365csv pre-anlysis/ # Statistical summaries and visualizations
│ ├── figures/ # Charts and graphs
│ └── models/ # Serialized models (e.g., final_xgb_model.pkl)
├── scripts/ # Utility functions and helper scripts
└── requirements.txt # Python dependencies
- Clone the Repository:
git clone [https://github.com/dhaneshbb/HomeLoanDef](https://github.com/dhaneshbb/HomeLoanDef).git cd HomeLoanDef
- Install Dependencies:
pip install -r requirements.txt
- Download Data:
Extract PRCP-1006-HomeLoanDef.zip intodata/1.1 raw/
.
-
Run the Jupyter Notebook:
- Execute
notebooks/PRCP-1006-HomeLoanDef.ipynb
for end-to-end analysis and modeling. - Key steps:
- Data preprocessing (missing value imputation, outlier handling).
- Feature engineering (e.g.,
CREDIT_UTILIZATION_RATIO
,WEIGHTED_EXT_SOURCE
). - Model training/evaluation and threshold optimization.
- Execute
-
Generate Reports:
- The notebook exports:
reports/Final Report.md
: Detailed analysis and model comparison.results/figures/
: Visualizations (e.g., ROC curves, feature importance plots).
- The notebook exports:
-
Use the Trained Model:
Load the saved XGBoost model for predictions:import joblib model = joblib.load('results/models/final_xgb_model.pkl')
Model | Accuracy | Recall | ROC AUC | Training Time (s) |
---|---|---|---|---|
XGBoost | 83% | 53% | 0.785 | 98.10 |
LightGBM | 74% | 68% | 0.787 | 88.67 |
CatBoost | 79% | 59% | 0.776 | 147.94 |
Optimal Threshold: 0.60 (balances risk detection vs. false positives).
- Top Features: External credit scores (
EXT_SOURCE
), demographic factors, and credit utilization ratios. - Business Impact: At a threshold of 0.60, the model detects 53% of defaulters while minimizing false positives.
- Full details: Final Report.
Challenge | Solution |
---|---|
High Memory Usage | Downcasted numerical dtypes |
Class Imbalance (8% defaults) | Adjusted scale_pos_weight in XGBoost |
Data Leakage | Removed post-application features (e.g., DAYS_LAST_PHONE_CHANGE ) |
Multicollinearity | VIF analysis + dropped correlated pairs (e.g., AMT_CREDIT vs AMT_GOODS_PRICE ) |
- Dataset Source: Home Credit Default Risk
- Custom Library: insightfulpy
- Technical Guides: Memory optimization, feature engineering, and model tuning best practices.
This project is licensed under the MIT License. See LICENSE for details.
This README provides a structured, overview of the project, linking to key files and highlighting critical insights. It balances technical detail with accessibility for collaborators.