- Supervised learning model
- Used for both classification and regression
- Predominantly used for binary classification
SVM maps training datapoints in space so as to maximize the gap (margin) between the categories. The best boundary that separates the data points into classes is called a hyperplane. The new points will be classified based off on which side of the hyperplane the fall. The points closest to the hyperplane are called support vectors. The separation gap between the two lines on the closest data points. is called the margin.
We can understand this concept more with the help of a heart related dataset. This dataset can be found on kaggle: https://www.kaggle.com/datasets/nareshbhat/health-care-data-set-on-heart-attack-possibility
But first we need to import the dependencies:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
The dataset looks like this:
It consists 303 observations and 14 features. There are no null values.
target = 1
represents the patient is in risk of a heart attack andtarget = 0
represents they are safe.
Lets find out the number of observations in each category by plotting a countplot using seaborn using the code:
ax = sns.countplot(data = df, x = 'target', palette = 'hls')
ax.bar_label(ax.containers[0])
Now, we need to plot a boxenplot which will show us the range of data of each parameter and if there are any outliers.
plt.figure(figsize = (12,6))
sns.boxenplot(data = df.drop(columns = 'target'))
plt.xticks(rotation = 30)
plt.show()
This plot shows that the range of the data in this dataset is quite uneven and there is an outlier in the cholestrol feature. Removing this outlier can increase the accuracy of the model.
We can remove the observation with this outlier with the following code:
df.loc[df['chol']==df['chol'].max()]
df.drop(85, axis = 0, inplace = True)
SVM tries to maximize the distance between the separating plane and the support vectors. If one feature (i.e. one dimension in this space) has very large values, it will dominate the other features when calculating the distance. If we rescale all features in a certain range, they all have the same influence on the distance metric and thus, improve the accuracy of the algorithm dramatically.
X = df.drop(columns = 'target')
y = df['target']
X_std = StandardScaler().fit_transform(X)
X_std = pd.DataFrame(X_std, columns = list(X.columns))
Now our boxenplot looks like this:
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size = 0.2, random_state = 13, stratify = y)
model = SVC()
model.fit(X_train, y_train)
print(model.score(X_train, y_train), model.score(X_test, y_test))
Training accuracy score : 90.0
Testing accuracy score : 95.1
print(classification_report(y_test, model.predict(X_test)))
cm = confusion_matrix(y_test, model.predict(X_test))
sns.heatmap(cm, annot = True, cmap = 'Blues')
SVM is a simple and accurate model if given a right quality dataset.
Not suitable for large datasets due to increased training time.
Works well with datasets with show a clear margin of separation between classes.
Not suitable for noisy datasets with overlapping classes.
It tends to be sensitive to the scale of data.
Works well with high dimensional data.