This repository contains a project that analyzes and predicts Customer Churn using the Credit Card Customers dataset from Kaggle.
Data analysis, modeling and inference are implemented in the project to end up with an interpretable model that is also able to predict customer churn. However, the focus of the project are neither the business case nor the analysis and modeling techniques; instead, the goal is to provide with a boilerplate that shows how to transform a research notebook into a development/production environment code using as few additional tools as possible. Concretely, the research notebook churn_notebook.ipynb is transformed into a python package contained in customer_churn.
Note that not all aspects necessary in a ML worklflow are covered:
- The Exploratory Data Analysis (EDA) and Feature Engineering (FE) done during the data processing are very simple.
- The generated artifacts are not tracked or versioned (e.g., model-pipelines, processed data, etc.)
- No thorough deployment nor API are created, although Docker containerization is introduced.
- etc.
In other words, you ca re-use the code of this repository as a template, but for small projects only. If you something bigger in your hands and are interested in some of the aforementioned topics, you can visit other of my boilerplate projects listed in the section Interesting Links. However, let me reiterate the leitmotiv of this repository: production-level data science code with as few tools as possible π
The starter code comes from a Udacity Machine Learning DevOps Engineer Nanodegree project, although the initial status was substantially changed.
Overview of contents:
In the project, credit card customers that are most likely to churn are identified.
A first version of the data analysis and modeling is provided in the notebook churn_notebook.ipynb. As mentioned, the main goal of the project is to transform that notebook content into a production ready package, applying clean code principles or tools:
- Readable, simple, concise code
- PEP8 conventions applied, checked with
pylint
andautopep8
- Modular and efficient code, with Object-Oriented patterns
- Documentation provided at different stages: code,
README
files, etc. - Error/exception handling
- Execution and data testing with
pytest
- Logging implemented during production execution and testing
- Installable python package
- Basic containerization with Docker
The used Credit Card Customers dataset from Kaggle has the following properties:
- Original dataset size: 22 columns, 10127 data points.
- It consists of categorical and numerical features of clients which churn (cease to be clients) or not; that churn status is the binary classification target used for training, encoded as the feature
Attrition_Flag
. - Number of features after data processing is carried out: 19.
- Modeling used on the data: logistic regression, support vector machines and random forests with grid search.
More details on the dataset can be found in data/README.md
.
The repository contains the following files and folders:
.
βββ Dockerfile # Docker image definition
βββ Instructions.md # Original instructions from Udacity (irrelevant here)
βββ Makefile # Simple Makefile for common chores: clean, etc.
βββ README.md # This file
βββ churn_notebook.ipynb # Research notebook
βββ config.yaml # Configuration file for production code
βββ customer_churn # Production library folder, package
βΒ Β βββ __init__.py
βΒ Β βββ churn_library.py # Production library
βΒ Β βββ transformations.py # Utilities for library
βββ data
βΒ Β βββ README.md # Explanation of dataset columns
βΒ Β βββ bank_data.csv # Dataset
βββ main.py # Executable of production code
βββ pics
βΒ Β βββ pipeline_diagram.png # Pipeline diagram
βΒ Β βββ sequencediagram_org.jpeg # Original/old sequence diagram from Udacity (irrelevant here)
βββ requirements.txt # Dependencies
βββ setup.py
βββ tests # Pytest testing scripts
βββ __init__.py
βββ conftest.py # Pytest fixtures
βββ test_churn_library.py # Tests of the module churn_library.py
All the research work of the project is contained in the notebook churn_notebook.ipynb
; in particular simplified implementations of the typical data processing and modeling tasks are performed:
- Dataset loading
- Exploratory Data Analysis (EDA) and Data Cleaning
- Feature Engineering (FE)
- Training: Logistic Regression, Support Vector Machine and Random Forest models are fit and persisted as pickles.
- Generation of classification report plots.
The code from churn_notebook.ipynb
has been transformed to create the package customer_churn
, which contains two files:
churn_library.py
: this file contains most of the refactored and modified code from the notebook.transformations.py
: definition of auxiliary transformations used in the data processing.
Additionally, a tests
folder is provided, which contains test_churn_library.py
. This script performs unit tests on the different functions of churn_library.py
using pytest.
The executable or main
function is provided in main.py
; this script imports the package customer_churn
and runs three functions from churn_library.py
:
run_setup()
: the configuration fileconfig.yaml
is loaded and auxiliary folders are created, if not there yet:images
: it will contain the images of the EDA and the model evaluation.models
: it will contain the inference models/pipelines as serialized pickles.artifacts
: it will contain the data processing parameters created during the training and required for the inference, serialized as pickles.
run_training()
: it performs the EDA, the data processing and the data modeling, and it generates the inference artifacts (the model/pipeline).run_inference()
: it shows how the inference artifacts need to be used to perform a prediction; an exemplary dataset sample created during the training is used.
The following diagram shows the workflow:
A central property of the package is that run_training()
and run_inference()
share the function perform_data_processing()
. When perform_data_processing()
is executed in run_training()
, it generates the processing parameters and stores them to disk. In contrast, when perform_data_processing()
is executed in run_inference()
, it loads those stored parameters to perform the data processing for the inference.
Note that the implemented run_inference()
is exemplary and it needs to be adapted:
- Currently, it is triggered manually and it scores a sample dataset from a
CSV
file offline; instead, we should wait for external requests that feed new data to be scored. - The data processing parameters and the model should be loaded once in the beginning (hence, the dashed box) and used every time new data is scored.
Those loose ends are to be tied when deploying the model, since their final form depends on the deployment technologies adopted. Even though containerization with docker is briefly shown in the section Docker Image, deployment is not really in the scope of this repository.
First, create an environment (e.g., with conda) and install the dependencies:
cd /path/to/repository/folder
# Create the environment
conda create -n my-env python=3.8
conda activate my-env
conda install pip
# Check that the pip we use points to our environment
which pip
# Install dependencies
pip install -r requirements.txt
Then, we execute the main file:
python main.py
... and all the magic happens, as explained in the previous section.
Note that the EDA and the model evaluation are optional and can be controlled with the argument --produce_images
in main.py
; by default, they are active, if we'd like to switch them off, we need to run:
python main.py --produce_images 0
Switching off EDA and evaluation is important for the Docker container: saving matplotlib figures in a container requires a more sophisticated setup than the one explained in the section Docker Image.
Optionally, you can install the package on your environment:
cd /path/to/repository/folder
conda activate my-env
# The first installation
pip install .
# If we change things in the package
pip install --upgrade .
If the package is installed, we can work anywhere if we have the dataset and the configuration file:
cd /path/to/new/folder
cp -r /path/to/repository/folder/data .
cp /path/to/repository/folder/main.py .
cp /path/to/repository/folder/config.yaml .
python main.py
To test the functions from churn_library.py
using pytest
:
cd /path/to/repository/folder
python tests/test_churn_library.py
This command not only tests the package, but it also fully runs the training and inference pipelines, generating models and other artifacts.
Note that, since the logging
module is used, it is not enough running barely pytest
in the Terminal, we need to run the command above to test the package.
If you would like to check the PEP8 conformity of the files:
pylint customer_churn/churn_library.py # should yield 8.07/10
pylint tests/test_churn_library.py # should yield 8.49/10
Tip: If you'd like to automatically edit and improve the score of a file that already has a good score, you can try autopep8
:
autopep8 --in-place --aggressive --aggressive customer_churn/churn_library.py
Because the C-Universe had already most of the required tools.
# If you like to delete the models, artifacts, folders, etc.
# created when the training pipelines are run
make clean
The Dockerfile defines a simple image based on Python 3.8 which can be built and executed as a container:
# Build image
docker image build -t customer_churn_app .
# Run with shell access and remove container when finished
docker container run -it --rm --name customer_churn_app customer_churn_app
As explained in Running the Package, the EDA and the model evaluation are switched off for the docker image, because saving matplotlib images requires a more sophisticated setup than the one provided.
- No tracking of datasets / models / experiments / artifacts is done. Doing that is fundamental if we want to scale, and it implies using additional tools, such as Weights and Biases. I have another boilerplate where that is done: music_genre_classification.
- This boilerplate works for small datasets/projects, which are not that uncommon in small/medium enterprises; in my experience, its structure is easy to understand, implement and adapt. However, as the complexity increases, it is recommended to apply these changes to the architecture:
- All data processing steps should be written in an Object Oriented style and packed into a Scikit-Learn
Pipeline
(or similar), as done withtransformations.py
. - Any data processing that must be applied to new data should be integrated in the inference pipeline generated in
train_models()
; that means that we should integrate most of the content inperform_data_processing()
as aPipeline
intrain_models()
. Thus,perform_data_processing()
would be reduced to basic tasks related to cleaning (e.g., duplicate removal) and checking.
- All data processing steps should be written in an Object Oriented style and packed into a Scikit-Learn
- No serving via an API is discussed.
Please, visit the section Interesting Links, where I will list further resources related to those points.
- Add dependencies and libraries to
README.md
. - Re-factor data processing functions to work as transformer classes.
- Create two separate simplified pipelines: training and inference.
- Re-factor data processing function to be able to run it during training or inference.
- Add logging to pipeline executions, too, not only to the testing module.
- Create a
conftest.py
file for testing. - Move function parameters to a YAML configuration file.
- Create a python package.
- Update
README.md
with new contents: new files, new execution commands, etc. - Draw new sequence diagram.
- Create a docker image.
- Explain the columns from the dataset in
data/README.md
. - Add function
load_processing_parameters()
forrun_inference()
beforeperform_data_processing()
; that way, we pass the parameters to the processing function, the same way we pass the model-pipeline topredict()
. - Address the fact that
load_processing_parameters()
andload_model_pipeline()
should be run only once at the beginning. - Further parametrize all hard-coded variables from
conftest.py
and include them inconfig.yaml
; aconfig
object should be passed to the functions inchurn_library.py
to avoid hard-coding. - Especially for classification problems, the training/inference pipeline should be able to deal with imbalanced datasets (e.g., performing trainings with over/undersampling variations in addition to the parameter search); that aspect is not covered yet.
- Feature selection could be added as a step; however, that probably means re-defining the spot where the
PolynomialFeatures
are applied now. - Further re-factor functions; e.g., some EDA plot generations contain repeated computations that can be parametrized.
- Work towards
pylint
score of 10/10. However, note that some variable names were chosen to be non-PEP8-conform due to their popular use in the field (e.g.,X_train
).
- Guide on EDA, Data Cleaning and Feature Engineering: eda_fe_summary.
- A boilerplate for reproducible (tracked) ML pipelines with MLflow and Weights & Biases which uses a music genre classification dataset from Spotify: music_genre_classification.
- If you are interested in the automated deployment of production-ready ML pipelines packaged in APIs, check my example repository census_model_deployment_fastapi.
- My notes on the Udacity Machine Learning DevOps Engineering Nanodegree, which features more MLOps content: mlops_udacity.
Mikel Sagardia, 2022.
No guarantees.
The original content of this project was taken from the Udacity Machine Learning DevOps Engineer Nanodegree, but it has been modified and extended to the present state.