This project is from Kaggle. You can upload your result in the Kaggle competitions and have fun!
- Creative feature engineering
- Advanced regression techniques like random forest and gradient boosting
- You can download the data from this repo "train.csv"
- "test.csv" is the test from Kaggle that you use the information to predict your sale prices and upload yours in Kaggle. I have mine "test_results.csv" for your reference.
- use
to convert the dataset to dataframe to have a general idea of the dataset - explore the data types
- plot a graph to check our target value distribution
: it is screwed. It should uselog
- cleaning data: change to column name and data fields to lower case, replace space with "_"
- translate some column value to their actual names, not just a representive number
- check numerical values, extreme value like 9999999999999
make it easier to check all the values without scrolling. - remove useless columns for the project
- split the data into train, validation and test (60%, 20%, 20%)
- get final target value and change it to log value
- reset the index
- remove the target value column
- fill NaN with 0
- make the dataframe records to dictionary
- apply one-hot-encoding for categorical values
- We don't know what model is the best fit. We will train different models and test the rmse. The model that returns the best rmse wins.
- use
- compare
) andy_train
, it showsrmse=0
. comparey_pred
) andy_train
, it showsrmse=0.2256
.The training model is overfitting. tuning
from sklearn.ensemble import RandomForestRegressor
- pick a range from 10 to 200 to train the model
- turn it to dataframe and plot it (
n_estimators = 160
is the best), but we dont fix it yettuning
: range [20, 30, 40, 50, 60, 70]min_samples_leaf
(how big the tree is): range [1, 5, 10, 15, 20]
- use
n_estimators=160, max_depth=20, min_samples_leaf=1
to train the model - rmse result improve comparing to decision tree model
import xgboost as xgb
train the model
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features) dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)```
xgb_output(output) function to capture the output (number interation, train_rmse, val_rmse)
from IPython.utils.capture import capture_output
import sys
to save the result using looptuning
: ETA is the learning rate of the model. XGBoost uses gradient descent to calculate and update the model. In gradient descent, we are looking for the minimum weights that help the model to learn the data very well. This minimum weights for the features is updated each time the model passes through the features and learns the features during training. Tuning the learning rate helps you tell the model what speed it would use in deriving the minimum for the weights.eta=0.3
is the best (faster and more accurate)max_depth
: how many trees?max_depth=6
is the best........................min_child_weight
: how big is the tree?min_child_weight=10
is the best.....
- use the above 3 pre-train model to test the rmse
- xgboost is the winner 💥
- repeat the previous steps to clean the data, get feature matirx, train the model and test with test dataset
- check whether you are happy with the result
to save the model
- the model is saved in
- In the termianl, run
to start your server. You might just needpython
- open another terminal to run your
. Or you might just needpython
. If it shows{'price': xxxx}
, the model and server are working. a sample house information is in the test file. Ctrl + c
to end the server
show how to generate the final results in csv format- method: read data to dataframe, the cleaning process
- get ids, saleprice array
- write the final results to csv
- upload your csv and check your score and ranking 😊 my first 2 tries are the same: 0.146; there is no difference if you round the number or not.
- definitely need more tunings, if you want to rank higher
- maybe start to use fewer features as the first step You should be proud of yourself with this compitition.
- to build a virtual environment, run
pip install pipenv
- install packages
pipenv install numpy sklearn==1.3.1 flask gunicorn xgboost requests
- now we have
- next time, when we run in a different machine, run
pipenv install
to install all required packages - run
pipenv shell
into virtual environment
- isoloate the environment from the host machine
- You can find docker image here
- I have chosen
to match my python version; choose as your choice docker run -it --rm --entrypoint=bash python:3.10
to download the docker
: access to terminal;--rm
: remove the image after installation;-entrypoint=bash
: communicate with the terminal usingbash
in the image- create a file
# install python
FROM python:3.10
# install pipenv
RUN pip install pipenv
# create and go to the directory
# copy file to current directory
COPY ["Pipfile", "Pipfile.lock", "./"]
# install packages and deploy them
RUN pipenv install --system --deploy
# copy file and mode to current directory
COPY ["", "model.bin", "./"]
# open port
# execute the service, bind the port host to 9696
ENTRYPOINT [ "gunicorn", "--bind=", "predict:app" ]
- create an aws account
- install eb cli as dev dependency
pipenv install awsebcli --dev
- go to virtual environment
pipenv shell
- initial the eb
eb init -p "Docker running on 64bit Amazon Linux 2" predict-house-price
ls -a
to check whether there is.elasticbeanstalk
folderls .elasticbeanstalk/
to check the doc inside the folderconfig.yml
- run locally to test
eb local run --port 9696
- in another terminal run
to test - implement in the cloud: create a cloud environment ->
eb create predict-house-price-env
- copy the service link to
, update our url - run
to test
I have terminated this service to avoid generating extra fees.
pipenv install streamlit
install streamlit- import other packages in
- write a tile
- create a sidebar header
- build a function that include the most important features in your model
- create sidebar slider in the function
- display the parameters as a table
- call the function and show the result
- run
streamlit run
to start the server - register an account in streamlit
- push your code in github
- click "Deploy" button when viewing your app at the up-right corner
- confirm your github folder, you main app name, for example
- 🎊Yay!!! Congrats! Your app is live🎊 House Price Prediction