9. k-Nearest-Neighbor Classifier with sklearn
By Bernd Klein. Last modified: 16 Jun 2023.
Introduction
The underlying concepts of the K-Nearest-Neighbor classifier (kNN) can be found in the chapter k-Nearest-Neighbor Classifier of our Machine Learning Tutorial. In this chapter we also showed simple functions written in Python to demonstrate the fundamental principals.
Instead of using these functions, even though they showed impressive results, we recommend to use the functionalities of the sklearn
module. We used sklearn
already in our previous chapters.
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Using sklearn for kNN
neighbors is a package of the sklearn module, which provides functionalities for nearest neighbor classifiers both for unsupervised and supervised learning.
The classes in sklearn.neighbors can handle both Numpy arrays and scipy.sparse matrices as input. For dense matrices, a large number of possible distance metrics are supported. For sparse matrices, arbitrary Minkowski metrics are supported for searches.
scikit-learn implements two different nearest neighbors classifiers:
- KNeighborsClassifier
- is based on the k nearest neighbors of a sample, which has to be classified. The number 'k' is an integer value specified by the user. This is the most frequently used classifiers of both algorithms.
- RadiusNeighborsClassifier
- is based on the number of neighbors within a fixed radius r for each sample which has to be classified. 'r' is float value specified by the user. This classifier is less often used.
KNeighborsClassifier
We will artificially create a dataset with three classes to test the k-nearest neighbor classifier 'KNeighborsClassifier' from 'sklearn.neighbors'. We described this in our chapter Data Set Creation for Machine Learning
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
centers = [[2, 3], [5, 5], [1, 8]]
n_classes = len(centers)
data, labels = make_blobs(n_samples=150,
centers=np.array(centers),
random_state=1)
Let us visualize what we have created:
import matplotlib.pyplot as plt
colours = ('green', 'red', 'blue')
n_classes = 3
fig, ax = plt.subplots()
for n_class in range(0, n_classes):
ax.scatter(data[labels==n_class, 0], data[labels==n_class, 1],
c=colours[n_class], s=10, label=str(n_class))
ax.legend(loc='upper right');
We have to split now the data in a test and train set.
from sklearn.model_selection import train_test_split
res = train_test_split(data, labels,
train_size=0.8,
test_size=0.2,
random_state=1)
train_data, test_data, train_labels, test_labels = res
We are ready now to perform the classification with the kNeighborsClassifier
:
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train_data, train_labels)
predicted = knn.predict(test_data)
print("Predictions from the classifier:")
print(predicted)
print("Target values:")
print(test_labels)
OUTPUT:
Predictions from the classifier: [2 2 2 0 0 1 1 2 2 1 0 1 0 0 2 0 0 0 1 0 0 1 1 2 0 0 0 1 2 1] Target values: [2 2 2 0 0 1 1 2 2 1 0 1 0 0 2 0 0 0 1 0 0 1 1 2 0 0 0 1 2 1]
To evaluate the result, we will use accuracy_score
from the module sklearn.metrics
. To see how accuracy_score
works, we will use a simple example with pseudo predictions and labels:
from sklearn.metrics import accuracy_score
example_predictions = [0, 2, 1, 3, 2, 0, 1]
example_labels = [0, 1, 2, 3, 2, 1, 1]
print(accuracy_score(example_predictions, example_labels))
OUTPUT:
0.5714285714285714
The return value corresponds to the quotient of correctly classified and the total number of predictions.
If you are only interested in the number of correctly classified items, you can set the parameter normalize
to False
. The default value is True
.
print(accuracy_score(example_predictions,
example_labels,
normalize=False))
OUTPUT:
4
Now we are ready to evaluate the results of our previous clissification example:
print(accuracy_score(predicted, test_labels))
OUTPUT:
1.0
You may have noticed that we instantiated the k-nearest neighbor classifier in our previous example by calling it without any arguments, i.e. KNeighborsClassifier()
.
In the following, we instantiate it with some possible keyword parameters:
knn = KNeighborsClassifier(algorithm='auto',
leaf_size=30,
metric='minkowski',
p=2,
metric_params=None,
n_jobs=1,
n_neighbors=5,
weights='uniform')
The parameter metric is Minkowski by default. We explained the Minkowski distance in our chapter k-Nearest-Neighbor Classifier. The parameter p
is the p
of the Minkowski formula: When p is set to 1, this is equivalent to using the manhattan_distance, and the euclidean_distance will be used if p is assigned the value 2.
The parameter 'algorithm` determines which algorithm will be used, i.e.
ball_tree
will use BallTreekd_tree
will use KDTreebrute
will use a brute-force search. We set the parameter toauto
which will attempt to decide the most appropriate algorithm based on the values passed to the fit method.
The parameter leaf_size
is needed by BallTree
or KDTree
. It can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
Using the Iris Data
In the following example we will use the Iris data set:
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
data, labels = iris.data, iris.target
res = train_test_split(data, labels,
train_size=0.8,
test_size=0.2,
random_state=12)
train_data, test_data, train_labels, test_labels = res
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
# classifier "out of the box", no parameters
knn = KNeighborsClassifier()
knn.fit(train_data, train_labels)
print("Predictions from the classifier:")
test_data_predicted = knn.predict(test_data)
print(test_data_predicted)
print("Target values:")
print(test_labels)
OUTPUT:
Predictions from the classifier: [0 2 0 1 2 2 2 0 2 0 1 0 0 0 1 2 2 1 0 2 0 1 2 1 0 2 1 1 0 0] Target values: [0 2 0 1 2 2 2 0 2 0 1 0 0 0 1 2 2 1 0 1 0 1 2 1 0 2 1 1 0 0]
print(accuracy_score(test_data_predicted, test_labels))
OUTPUT:
0.9666666666666667
print("Predictions from the classifier:")
learn_data_predicted = knn.predict(train_data)
print(learn_data_predicted)
print("Target values:")
print(train_labels)
print(accuracy_score(learn_data_predicted, train_labels))
OUTPUT:
Predictions from the classifier: [0 1 2 0 2 0 1 1 0 1 1 0 0 0 0 0 0 0 2 0 2 1 1 1 0 2 1 1 2 0 2 0 2 1 2 2 1 1 1 2 2 0 2 2 0 1 0 2 2 0 1 1 0 0 1 1 1 1 2 1 2 0 0 1 1 2 0 2 1 0 2 2 1 2 2 0 0 2 1 1 2 0 1 1 0 1 1 2 2 1 0 2 0 2 0 0 1 2 2 1 2 2 0 1 1 0 2 2 2 1 2 2 2 0 0 1 0 2 2 1] Target values: [0 1 2 0 2 0 1 1 0 1 1 0 0 0 0 0 0 0 2 0 2 1 1 1 0 2 1 1 2 0 2 0 2 2 2 2 1 1 1 1 2 0 2 2 0 1 0 2 2 0 1 1 0 0 1 1 1 1 2 1 2 0 0 1 1 1 0 2 1 0 2 2 1 2 2 0 0 2 1 1 2 0 1 1 0 1 1 2 2 1 0 2 0 2 0 0 1 2 2 1 2 2 0 1 1 0 2 2 2 1 2 2 2 0 0 1 0 2 2 1] 0.975
knn2 = KNeighborsClassifier(algorithm='auto',
leaf_size=30,
metric='minkowski',
p=2, # p=2 is equivalent to euclidian distance
metric_params=None,
n_jobs=1,
n_neighbors=5,
weights='uniform')
knn.fit(train_data, train_labels)
test_data_predicted = knn.predict(test_data)
accuracy_score(test_data_predicted, test_labels)
OUTPUT:
0.9666666666666667
RadiusNeighborsClassifier
The way of working of the k nearest neighbor classifier consists in increasing a circle around the unknown (i.e. the item which needs to be classified) sample until the circle contains exactly k items. The Radius Neighbors Classifier has a fixed length for the surrounding circle. It locates all items in the training dataset that are within the circle with the given radius length around the item, which has to be classified. As a consequence of the fixed radius approach dense regions of the feature distribution will provide more information and sparse regions will contribute less information.
from sklearn.neighbors import RadiusNeighborsClassifier
X = [[0, 1], [0.5, 1], [3, 1], [3, 2], [1.3, 0.8], [2.5, 2.5], [2.4, 2.6]]
y = [0, 0, 1, 1, 0, 1, 1]
neigh = RadiusNeighborsClassifier(radius=1.0)
neigh.fit(X, y)
print(neigh.predict([[1.5, 1.2]]))
print(neigh.predict([[3.1, 2.1]]))
OUTPUT:
[0] [1]
If we try to make a prediction on data like [30, 20], the algorithm cannot find any neighbors for the radius 1.0. So it will raise an exception with the following text:
ValueError: No neighbors found for test samples array([0]), you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.
There is a parameter for setting the label for outlier, i.e. outlier_label
.
There are three ways to use it:
- manual label: str or int label (should be the same type we are using in our data) or list of manual labels if multi-output is used.
- It can be set to the value 'most_frequent'. This will assign the most frequently occurring label of the data set to outliers.
- If it is set to
None
(the default), aValueError
will be raised when an outlier is detected.
Let's do it again with 'most_frequent
neigh = RadiusNeighborsClassifier(radius=1.0,
outlier_label='most_frequent')
neigh.fit(X, y)
print(neigh.predict([[1.5, 1.2]]))
# the following is the previously mentioned outlier:
print(neigh.predict([[30, 20]]))
OUTPUT:
[0] [1]
Alternatively, we set the outlier class to 2. We add one outlier element to our learnset:
from sklearn.neighbors import RadiusNeighborsClassifier
X = [[0, 1], [0.5, 1], [3, 1], [3, 2], [1.3, 0.8], [2.5, 2.5], [2.4, 2.6], [10000, -2321]]
y = [0, 0, 1, 1, 0, 1, 1, 2]
neigh = RadiusNeighborsClassifier(radius=1.0,
outlier_label=2)
neigh.fit(X, y)
print(neigh.predict([[1.5, 1.2]]))
print(neigh.predict([[30, 20]]))
OUTPUT:
[0] [2]
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
centers = [[2, 3], [9, 2], [7, 9]]
n_classes = len(centers)
data, labels = make_blobs(n_samples=255,
centers=np.array(centers),
cluster_std = 1.3,
random_state=1)
data[:5]
OUTPUT:
array([[10.88685804, 1.1965521 ], [ 9.67101133, 9.0694324 ], [ 4.56489073, 10.19679965], [ 8.99754107, 0.18439345], [ 1.10084102, 2.48422042]])
import matplotlib.pyplot as plt
colours = ('green', 'red', 'blue')
n_classes = 3 # not using the outlier 'class'
fig, ax = plt.subplots()
for n_class in range(0, n_classes):
ax.scatter(data[labels==n_class, 0], data[labels==n_class, 1],
c=colours[n_class], s=10, label=str(n_class))
res = train_test_split(data, labels,
train_size=0.8,
test_size=0.2,
random_state=1)
train_data, test_data, train_labels, test_labels = res
Let's add one row to the end of the train_data which contains outlier data, i.e. not belonging to any class:
outlier = [4242.2, 4242.2]
train_data = np.vstack([train_data, outlier])
train_data[-3:]
OUTPUT:
array([[ 8.42869523, 7.82787516], [ 8.01064497, 8.84559748], [4242.2 , 4242.2 ]])
Now we have to add an outlier label to the labels.
outlier_label = len(np.unique(labels))
train_labels = np.append(train_labels, outlier_label)
train_labels[-10:]
OUTPUT:
array([0, 0, 0, 1, 0, 0, 0, 2, 2, 3])
np.unique(train_labels)
OUTPUT:
array([0, 1, 2, 3])
rnn = RadiusNeighborsClassifier(radius=1)
rnn.fit(train_data, train_labels)
OUTPUT:
RadiusNeighborsClassifier(radius=1)
predicted = rnn.predict(test_data)
print(accuracy_score(predicted, test_labels))
OUTPUT:
1.0
Let's shrink the radius:
rnn = RadiusNeighborsClassifier(radius=0.9,
outlier_label=outlier_label)
rnn.fit(train_data, train_labels)
predicted = rnn.predict(test_data)
print(accuracy_score(predicted, test_labels))
OUTPUT:
0.9803921568627451
Let's create some outliers and test them:
centers = [[100, 300]]
data_outliers, labels_outliers = make_blobs(n_samples=10,
centers=np.array(centers),
random_state=1)
predicted = rnn.predict(data_outliers)
predicted
OUTPUT:
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
A good value for k
is the square root of all the samples in the training set:
k = int(len(labels) ** 0.5)
# make this value odd:
if k % 2 == 0:
k += 1
k
OUTPUT:
15
Let us compare this with a k nearest neighbor classifier:
knn = KNeighborsClassifier(algorithm='auto',
leaf_size=30,
metric='minkowski',
metric_params=None,
n_jobs=1,
n_neighbors=k, # default is 5
p=2, # p=2 is equivalent to euclidian distance
weights='uniform')
knn.fit(data, labels)
OUTPUT:
KNeighborsClassifier(n_jobs=1, n_neighbors=15)
predicted = knn.predict(test_data)
print(accuracy_score(predicted, test_labels))
OUTPUT:
1.0
from sklearn.metrics import confusion_matrix
# Evaluate Model
cm = confusion_matrix(predicted, test_labels)
print(cm)
OUTPUT:
[[24 0 0] [ 0 18 0] [ 0 0 9]]
predicted = knn.predict(data_outliers)
predicted
OUTPUT:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
We can see that all the outliers have been wrongly classified as class 2
because this is the closest existing class to the outliers. We create in the following three clusters of outliers:
centers = [[100, 300], [10, -10], [-200, -200]]
data_outliers2, labels_outliers2 = make_blobs(n_samples=30,
centers=np.array(centers),
random_state=1)
predicted = knn.predict(data_outliers2)
predicted
OUTPUT:
array([2, 2, 2, 1, 0, 2, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 2, 0, 2, 0, 1, 1, 2, 2, 1, 2, 0, 1, 0, 0])
The outliers are asigned to the existing clusters even though they are far away from them. On the other hand the RadiusNeighbirClassifier will recognize them as outliers:
rnn = RadiusNeighborsClassifier(radius=0.9,
outlier_label=outlier_label)
rnn.fit(train_data, train_labels)
predicted = rnn.predict(data_outliers2)
predicted
OUTPUT:
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses
Determining the Optimal k Value
As we have written the the optimal value for k
is usually the square root of n, where n is the total number of samples of our data set.
We can also determine a value for k by plotting the accuracy values for different k
values:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
n_classes = 6
data, labels = make_blobs(n_samples=1000,
centers=n_classes,
cluster_std = 1.3,
random_state=1)
import matplotlib.pyplot as plt
colours = ('green', 'red', 'blue', 'magenta', 'yellow', 'pink')
fig, ax = plt.subplots()
for n_class in range(0, n_classes):
ax.scatter(data[labels==n_class, 0], data[labels==n_class, 1],
c=colours[n_class], s=10, label=str(n_class))
res = train_test_split(data, labels,
train_size=0.7,
test_size=0.3,
random_state=1)
train_data, test_data, train_labels, test_labels = res
print(len(train_data), len(test_data), len(train_labels))
X, Y = [], []
for k in range(1, 25):
classifier = KNeighborsClassifier(n_neighbors=k,
p=2, # Euclidian
metric="minkowski")
classifier.fit(train_data, train_labels)
predictions = classifier.predict(test_data)
score = accuracy_score(test_labels, predictions)
X.append(k)
Y.append(score)
fig, ax = plt.subplots()
ax.set_xlabel('k')
ax.set_ylabel('accuracy')
ax.plot(X, Y, "go")
OUTPUT:
700 300 700
[<matplotlib.lines.Line2D at 0x7fc1c3ce6850>]
Exercises
Exercise 1
Classify the data in "strange_flowers.txt" with a k nearest neighbor classifier.
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Solutions
Solution to Exercise 1
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # necessary to reduce biases of large numbers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
dataset = pd.read_csv("data/strange_flowers.txt",
header=None,
names=["red", "green", "blue", "size", "label"],
sep=" ")
dataset
red | green | blue | size | label | |
---|---|---|---|---|---|
0 | 252.0 | 96.0 | 10.0 | 3.63 | 1.0 |
1 | 249.0 | 115.0 | 10.0 | 3.59 | 1.0 |
2 | 235.0 | 107.0 | 0.0 | 3.81 | 1.0 |
3 | 255.0 | 110.0 | 6.0 | 3.91 | 1.0 |
4 | 247.0 | 104.0 | 8.0 | 3.41 | 1.0 |
... | ... | ... | ... | ... | ... |
790 | 197.0 | 250.0 | 108.0 | 2.69 | 4.0 |
791 | 197.0 | 250.0 | 107.0 | 3.05 | 4.0 |
792 | 197.0 | 241.0 | 109.0 | 3.23 | 4.0 |
793 | 197.0 | 243.0 | 92.0 | 3.00 | 4.0 |
794 | 197.0 | 252.0 | 96.0 | 3.06 | 4.0 |
795 rows × 5 columns
Instead of using Pandas to read in the 'strange_flowers.txt' data, we could use 'loadtxt' from numpy:
# alternative way to read and extract the data
import numpy as np
raw_data = np.loadtxt("data/strange_flowers.txt")
data = raw_data[:,:-1]
labels = raw_data[:,-1]
We will continue now with the Pandas DataFrame object 'dataset', whe we read in with 'read_csv':
data = dataset.drop('label', axis=1)
labels = dataset.label
X_train, X_test, y_train, y_test = train_test_split(data,
labels,
random_state=0,
test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # transform
X_test = scaler.transform(X_test) # transform
X_train
OUTPUT:
array([[ 1.0031888 , -0.39408598, -0.38229346, -0.06392483], [-1.1023726 , 1.9321053 , 1.79682762, -1.61096419], [ 1.30398328, -0.51208119, -0.48484033, 0.94680755], ..., [-1.1023726 , 1.83096655, 2.1813784 , -1.63159138], [-1.57504965, -0.39408598, -0.66429736, 1.48311452], [-1.1023726 , 1.79725363, 2.00192137, -0.70336777]])
We set k to the square root of size of the learn set:
k = int(len(X_train) ** 0.5)
k
OUTPUT:
25
# Define the model
classifier = KNeighborsClassifier(n_neighbors=k,
metric="minkowski",
p=2, # Euclidian
) # p for different label types
classifier.fit(X_train, y_train)
OUTPUT:
KNeighborsClassifier(n_neighbors=25)
y_pred = classifier.predict(X_test)
y_pred
OUTPUT:
array([3., 1., 3., 4., 3., 3., 1., 4., 3., 3., 4., 1., 3., 1., 2., 2., 2., 3., 1., 4., 2., 3., 4., 2., 3., 3., 4., 4., 1., 2., 1., 2., 2., 3., 1., 3., 3., 2., 2., 2., 3., 3., 4., 1., 4., 2., 3., 2., 3., 2., 2., 3., 1., 3., 4., 1., 2., 4., 2., 3., 3., 4., 3., 4., 3., 2., 1., 2., 1., 3., 3., 1., 4., 2., 2., 3., 2., 4., 2., 4., 1., 3., 4., 2., 4., 3., 2., 2., 2., 3., 1., 2., 3., 3., 1., 4., 2., 2., 2., 2., 1., 1., 4., 3., 3., 3., 2., 1., 1., 4., 2., 3., 3., 1., 2., 4., 3., 1., 1., 2., 1., 4., 3., 4., 2., 2., 3., 2., 4., 1., 4., 2., 4., 4., 4., 4., 4., 2., 4., 4., 4., 2., 3., 2., 1., 2., 2., 3., 1., 1., 3., 1., 2., 4., 2., 4., 1., 3., 1.])
# Evaluate Model
cm = confusion_matrix(y_test, y_pred)
print(cm)
OUTPUT:
[[28 4 0 0] [ 4 43 0 0] [ 0 0 44 0] [ 0 0 0 36]]
print(accuracy_score(y_test, y_pred))
OUTPUT:
0.949685534591195
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses