About the Data The dataset comprises various features, both categorical and numerical, aiming to predict whether employees will leave the company in the future. These attributes provide valuable insights for HR decision-making and employee retention strategies.
Dataset Details:
- Education: Education level of the employee.
- JoiningYear: Year in which the employee joined the company.
- City: City where the employee's office is located.
- PaymentTier: Payment tier categorization (1: Highest, 2: Mid Level, 3: Lowest).
- Age: Current age of the employee.
- Gender: Gender of the employee.
- EverBenched: Whether the employee has ever been kept out of projects for 1 month or more.
- ExperienceInCurrentDomain: The number of years of experience employees have in their current field.
- Label:
- LeaveOrNot: Binary indicator of whether the employee is predicted to leave the company in the future (1: Leaves, 0: Stays).
Dataset Source: The dataset can be found on Kaggle.
The primary objective is to develop a predictive model that accurately forecasts employee turnover in the future. The model's insights will empower HR teams to proactively address retention challenges and implement strategic initiatives to enhance employee satisfaction, ultimately reducing attrition.
-
Analysis of Categorical Variables:
-
Education Distribution:
-
City Distribution:
-
Gender Distribution:
-
Ever Benched Distribution:
-
Joining Year Distribution:
-
PaymentTier Distribution:
-
-
Analysis of Numerical Variables:
-
Age:
- Leave: 29.052500
- Stay: 29.571896
-
Experience in Current Domain:
- Leave: 2.840000
- Stay: 2.940059
-
Education:
- Bachelors: 0.313524
- Masters: 0.487973
- PHD: 0.251397
-
City:
- Bangalore: 0.267056
- New Delhi: 0.316335
- Pune: 0.503943
-
Gender:
- Female: 0.471467
- Male: 0.257739
-
Ever Benched:
- No: 0.331257
- Yes: 0.453975
-
LeaveOrNot:
- 0: 0.0
- 1: 1.0
-
Joining Year:
- 2012: 0.216270
- 2013: 0.334828
- 2014: 0.247496
- 2015: 0.407170
- 2016: 0.222857
- 2017: 0.268051
- 2018: 0.986376
-
PaymentTier:
- 1: 0.366255
- 2: 0.599129
- 3: 0.275200
- Extract only those rows in the column
LeaveOrNot
where the value is eitherStay
orLeave
. - Split the data into train, validation, and test sets in a two-step process, resulting in a distribution of 60% train, 20% validation, and 20% test sets. Use a random seed of
11
. - Prepare the target variable
LeaveOrNot
by converting it from categorical to binary, where 0 representsStay
and 1 representsLeave
. - Delete the target variable from the train, validation, and test dataframes.
We aim to identify the most suitable model by training various models and evaluating their performance using the roc_auc_score. The model with the highest roc_auc_score will be considered the most effective.
- Use
DecisionTreeClassifier()
. - Compare
y_pred
(withX_train
) andy_train
, it showsroc_auc_score = 0.9855092934065315
. - Compare
y_pred
(withX_val
) andy_train
, it showsroc_auc_score = 0.7861772633103365
. The training model seems to be overfitting.
Tuning:
max_depth
(how many trees): pick a relatively high roc_auc_score value; the result can be varied.min_samples_leaf
(how big the tree is): set a range of max_depth from the last step, in each depth, loop through a group ofmin_samples_leaf
.
from sklearn.ensemble import RandomForestClassifier
- pick a range from 10 to 200 to train the model
- turn it to dataframe and plot it (
n_estimators = 180
seems to be the best), but we dont fix it yet
Tuning:
max_depth
: range [5, 10, 15]min_samples_leaf
(how big the tree is): range [1, 3, 5, 10, 50]- Use
n_estimators=180, max_depth=10, min_samples_leaf=3
to train the model. - The roc_auc_score result has improved compared to the decision tree model.
import xgboost as xgb
- train the model
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features) dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)```
- xgb_output(output) function to capture the output
- plot the graph
Tuning
XGBoost has various tunable parameters but the three most important ones are:
eta
(default=0.3)- It is also called
learning_rate
and is used to prevent overfitting by regularizing the weights of new features in each boosting step. range: [0, 1]
- It is also called
max_depth
(default=6)- Maximum depth of a tree. Increasing this value will make the model mroe complex and more likely to overfit. range: [0, inf]
min_child_weight
(default=1)- Minimum number of samples in leaf node. range: [0, inf]
- XGBoost was selected as the final model, which yielded the best results.
- Use the full train dataset to train the model again and Test.
- Repeat the previous steps to obtain the feature matrix, train the model, and test it with the test dataset.
- Evaluate the final model's performance on the test dataset.
- Review the results to ensure the model's performance meets the desired criteria.
- Utilize the
pickle
library to save the trained XGBoost model.
- Create a script named
train.py
. - Save the trained model in a file named
model.bin
.
- Create a file named
predict.py
. - Open a terminal and run
python predict.py
to start the server. - Open another terminal and run
python predict-test.py
. If it displays{'turnover': True or False, 'turnover_probability': xxxxx}
, it indicates that the model and server are functioning. Thepredict-test.py
file contains a sample employee information. - Use
Ctrl + c
to stop the server.
- To build a virtual environment, run
pip install pipenv
. - Install required packages with
pipenv install numpy sklearn==1.3.1 flask waitress xgboost requests
. - Use
pipenv shell
to enter the virtual environment.
To isolate the environment from the host machine, follow these steps:
-
Choose a Python version for your Docker image. You can find Docker images here.
-
Create a file named
Dockerfile
with the following content:# Install Python FROM python:3.10.13 # Install pipenv RUN pip install pipenv # Create and go to the directory WORKDIR /app # Copy files to the current directory COPY ["Pipfile", "Pipfile.lock", "./"] # Install packages and deploy them RUN pipenv install --system --deploy # Copy files to the current directory COPY ["predict.py", "model.bin", "./"] # Open port EXPOSE 9696 # Execute the service, bind the host port to 9696 ENTRYPOINT ["waitress-serve", "--listen=0.0.0.0:9696", "predict:app"]
-
Build the Docker image using the following command:
docker build -t your_image_name .
- Replace
your_image_name
with a name of your choice.
- Replace
-
Run the Docker container:
docker run -it --rm -p 9696:9696 your_image_name
it
: access to terminal.--rm
: remove the container after stopping it.- The
-p
option binds the host port to the container port.
-
Push your image to Docker Hub:
First, push your Docker image to Docker Hub by executing the following commands in your terminal:
docker login docker tag your_image_name YOUR_DOCKERHUB_NAME/image_name docker push YOUR_DOCKERHUB_NAME/image_name
Replace
your_image_name
with the name of your Docker image, andYOUR_DOCKERHUB_NAME/image_name
with your Docker Hub username and the name you want for your image. -
Deploy on Render:
-
Open Render in your web browser.
-
Create a new web service by selecting "Deploy an existing image from a registry."
- Enter the image URL in the format "YOUR_DOCKERHUB_NAME/image_name."
- Complete the configuration and initiate the web service.
-
-
Configure Predict Cloud File:
- Set the host in
predict-cloud.py
to the URL provided by Render for your deployed web service.
- Set the host in
-
Run the Cloud Service:
-
Open a new terminal and run the following command:
python predict-cloud.py
-
-
View the Final Result:
- Check the result in your Render dashboard or visit the provided URL. The final result should be displayed.
Now, deploy your web service, which hosts the model, to the cloud. Access it through the provided URL
Note:
- You can deploy the application on various cloud providers like Google Cloud, Heroku, AWS, etc.
- I have tested the deployment on both AWS Elastic Beanstalk and Render.com.