This is a midterm project for ML Zoomcamp based on data from Kaggle competition "How Bitter is the Beer"
The International Bittering Units, or IBU, scale is used to quantify the bitterness of a beer. In this project, I've built a machine learning model and make a service to predict the IBU of a beer, given other attributes about the beer, including its color, alcohol content, and other features.
Exploratory Data Analysis of different items of beer have been done and published there.
For this issue were trained and estimated 4 different types of models:
- Ridge (linear least squares with l2 regularization.)
- RandomForestRegressor (ensemble of classifying decision trees fitted on various sub-samples)
- ExtraTreesRegressor (ensemble of randomized decision trees (a.k.a. extra-trees) fitted on various sub-samples)
- CatBoostRegressor (algorithm for gradient boosting on decision trees)
In order to get best prediction I've used 3 ways of categorical features encoding:
- OneHotEncoding for Ridge, RandomForestRegressor and ExtraTreesRegressor (notebook)
- TargetEncoding for Ridge, RandomForestRegressor and ExtraTreesRegressor (notebook)
- Ordered TargetEncoding for CatBoostRegressor (notebook)
As a tool for dependency management and packaging in Python I choose poetry.
All project dependencies can be founded in pyproject.toml and poetry.lock.
For linting and formatting python code were used such tools as black and flake8.
For all scripts I used package click for creating CLI.
├── README.md <- The top-level README for developers using this project.
|
├── config
│ ├── model_params.pkl <- Parameters for CatBoostRegressor.
│ └── test_inst.json <- Data of item of beer for testing service.
|
├── Docker
| └── Dockerfile <- Dockerfile for making service image.
|
├── data
│ ├── beer_submission.csv <- Final prediction for submission to Kaggle.
│ ├── beer_test.csv <- Test data from Kaggle.
│ └── beer_train.csv <- Train data from Kaggle competition.
│
├── models
| └── model.cbn <- Trained and serialized models.
│
├── src
| └── beerbitterregressor
| ├── app
| | ├── app.py <- Flask application
| | └── test_app.py <- Script for testing service
| ├── notebooks
| | ├── Catboost.ipynb <- Evaluating Catboost model
| | ├── EDA.ipynb <- Exploratory Data Analysis
| | ├── OneHotEncoding.ipynb <- Evaluating models using OneHotEncoding
| | └── TargetEncoding.ipynb <- Evaluating models using TargetEncoding
| |
| ├── predict.py <- Predict bitterness of beer
| ├── preprocessing.py <- Functions for preprocessing raw data
| └── train.py <- Train a catboost model on beer dataset
├── .gitignore <- List of files git should ignore
├── poetry.lock <- File is a list of dependencies versions
└── pyproject.toml <- File is a list of requirement specifiers for dependencies.
This package allows you to train model for predicting bitterness of beer, predict bitterness by using fitted model. Also you are able to run a service in docker container.
run this and following commands in a terminal, from the root of a cloned repository
- Clone this repository to your machine.
- Make sure Python 3.9 and Poetry are installed on your machine (I use Poetry 1.2.2).
- Install the project dependencies:
poetry install --without dev
- Install Docker
- Run train with the following command:
poetry run train -d <path to csv with data> -m <path to save trained model> -p <path to model params>
- Run predict with the following command:
poetry run predict -d <path to csv with data> -s <path to save result of prediction> -m <path of model>
Model has been deploymented by flask link. One way to create a WSGI server is to use gunicorn. This project was packed in a Docker container, you're able to run project on any machine. In order to run service install docker. Then build image
docker build -t beer_bitter_service -f Docker/Dockerfile .
Now lets run container
docker run -d -p 9696:9696 beer_bitter_service:latest
As a result, you can test service by using script src\beerbitterregressor\app\test_app.py.
poetry run test_app -d <path to .json with data>
Link to example of running this project you can find there.