Decision Tree

Supervised learning model
Used for both classification and regression problems (mainly for Classification)
Graphical representation for getting all possible solution to a problem based on given conditions

Split tree into subtrees based on answer(Yes/No)

                                 Example of Binary Decision Tree

Terminologies

Root Node: Node from which tree starts
Leaf Node: Final output node
Parent/Child Node: Root Node=Parent Node & Other Nodes=Child Nodes
Splitting- Splitting tree into sub-trees based on given conditions

We can understand this concept more with the help of a heart related dataset. This dataset can be found on kaggle: https://www.kaggle.com/datasets/nareshbhat/health-care-data-set-on-heart-attack-possibility

Importing dependencies

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, plot_tree

The dataset looks like this:

It consists 303 observations and 14 features. There are no null values.

target = 1 represents the patient is in risk of a heart attack and target = 0 represents they are safe.

Lets find out the number of observations in each category by plotting a countplot using seaborn using the code:

ax = sns.countplot(data = df, x = 'target', palette = 'hls')
ax.bar_label(ax.containers[0])

Range of data and Outliers

Now, we need to plot a boxenplot which will show us the range of data of each parameter and if there are any outliers.

plt.figure(figsize = (12,6))
sns.boxenplot(data = df.drop(columns = 'target'))
plt.xticks(rotation = 30)
plt.show()

This plot shows that the range of the data in this dataset is quite uneven and there is an outlier in the cholestrol feature but in case of medical records, removing an outlier observation is not a good idea.

Handling the outlier

Decision Trees are not sensitive to noisy data or outliers. So, we don't need to remove it.

Scaling the data

Decision trees do not require feature scaling to be performed as it's not sensitive to the the variance in the data.

Splitting data into training and testing sets

X = df.drop(columns = 'target')
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20, stratify = y, random_state = 63)

Model Fitting and Training

model=DecisionTreeClassifier(criterion='entropy')
model.fit(x_train,y_train)

Checking the training and testing scores:

print(model.score(X_train, y_train), model.score(X_test, y_test))

Training accuracy score : 100.0
Testing accuracy score : 83.6

Plotting the Tree

mtp.figure(figsize=(25, 18))
plot_tree(model, filled=True, rounded=True, class_names=['Less Chance', 'More Chance'], feature_names=x.columns);

Classification Report

print(classification_report(y_test, model.predict(X_test)))

Confusion matrix

cm = confusion_matrix(y_test, model.predict(X_test))
sns.heatmap(cm, annot = True, cmap = 'Blues')

Summary

A decision tree does not require normalization and scaling of data.

A Decision tree model is very intuitive and easy to explain.

A small change in the data can cause a large change in the structure of the decision tree causing instability.

Decision tree training is relatively expensive as the complexity and time has taken are more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decision Tree.md

Decision Tree.md

Decision Tree

Terminologies

Importing dependencies

Range of data and Outliers

Handling the outlier

Scaling the data

Splitting data into training and testing sets

Model Fitting and Training

Checking the training and testing scores:

Plotting the Tree

Classification Report

Confusion matrix

Summary

Files

Decision Tree.md

Latest commit

History

Decision Tree.md

File metadata and controls

Decision Tree

Terminologies

Importing dependencies

Range of data and Outliers

Handling the outlier

Scaling the data

Splitting data into training and testing sets

Model Fitting and Training

Checking the training and testing scores:

Plotting the Tree

Classification Report

Confusion matrix

Summary