Skip to content

Data-analysis-FUM/KNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

KNN

Objectives of this training

  • Understand the working of KNN and how it operates in python and R.
  • Get to know how to choose the right value of k for KNN.
  • Understand the difference between training error rate and validation error rate.

When Do We Use the KNN Algorithm?

KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique, we generally look at 3 important aspects:

  1. Ease of interpreting output
  2. Calculation time
  3. Predictive power

How Does the KNN Algorithm Work?

How Do We Choose the Factor K?

To find the best K value in KNN algorithm, you can use a method called cross-validation.

In cross-validation, the data is split into "folds" (subsets), and for each fold, the model is trained on the remaining data and evaluated on the held-out fold. This process is repeated for each fold, and the average performance is computed.

To find the optimal K value using cross-validation, you can try different values of K (e.g., 1, 3, 5, 7, etc.) and for each value of K, perform cross-validation. You can then choose the K that gives the best performance.

One way to implement this in Python is to use the scikit-learn library. Here's an example code snippet:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# load your data and split it into features and labels

k_values = [1, 3, 5, 7, 9] # try different K values
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy') # 10-fold cross-validation
    cv_scores.append(scores.mean())

# plot the results to help visualize the best K value
import matplotlib.pyplot as plt
plt.plot(k_values, cv_scores)
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.show()

best_k = k_values[cv_scores.index(max(cv_scores))] # choose the K with the highest accuracy

Try it...