K Nearest Neighbor – A data driven Machine Learning Algorithm

Most of the Supervised Machine Learning algorithms are model based, i.e., given a set of training points we find a function that maps input points to output labels. This means that after utilization of the training points to find the model function, we can throw away data points keeping only the function parameters!

Today we will look past this model-driven approach and work on a data-driven machine learning algorithm – K Nearest Neighbor (KNN). What’s more is we will be going full Super Developer Mode and build it from scratch! I too love scikit-learn, but sometimes it’s fun to code. 😉

KNN Step-by-Step

KNN works on the principle of majority wins and similarity matters. What do I mean by that? Let’s understand how KNN works!

  1. Instantiate a KNN object with a natural number.
  2. Provide the KNN object with the training data. – .fit(X, y).
  3. The .fit() method does only one thing! It creates a copy of all of these training points and stores it in the KNN object.
  4. Prediction:
    1. A given test point is matched to every single point in the training set copy using a similarity metric.
    2. Similarity metric is a mathematical function that takes two points – a test set point and a training set point and gives a real number- which is bigger if the two points are more similar and small if the two points are less similar.
    3. Collect K highest similarity values and labels of associated training points out of all these estimations.
    4. Find the label with majority out of these K points.
    5. This label is our prediction result.
    6. Repeat step 4 for all test points.

Remember: Here, similarity metric, when referred in mathematical terms, is inverse of distance measure between two points.

Before going ahead, let’s define a class that mimics our requirement till this point!!

import numpy as np

class KNN:
    def __init__(self, k):
        self.k = k

    def fit(self, X, y):
        self.X = np.copy(X)
        self.y = np.copy(y)

    def distance(self, x_test):
        pass

    def predict(self, x_test):
        preds = []
        for index in range(x_test.shape[0]):
            # get similarities
            distances = self.distance(x_test[index])
            # Sort labels in ascending order of distances
            sorted_labels = self.y[np.argsort(distances)]
            # Take first k labels
            k_sorted_labels = sorted_labels[:self.k]
            # get unique labels and their counts
            unique_labels, counts = np.unique(k_sorted_labels, return_counts=True)
            # predicition is the label with maximum number of counts
            pred = unique_labels[np.argmax(counts)]
            preds.append(pred)
        return np.array(preds)

Now that we’ve a basic structure of our class ready, let’s learn about the distance measures that play major role for similarity metrics.

Distance Measures for KNN

Let’s look at 3 well-known distance measures:

  • Manhattan Distance
  • Euclidean Distance
  • Minkowski Distance

Manhattan Distance

Use this when there are features with vastly different range. It is the sum of absolute difference between the features of the two points.

Euclidean Distance

This is the most frequently used distance function. Make sure to standardize the data so that each feature has almost identical range. It is the root mean square of the difference between the features of the two points.

Minkowski Distance

A distance function which generalizes the basic formula of Manhattan and Euclidean Distance. It is the pth root of the sum of pth power of absolute difference between the features of the two points.

For,

  • p = 1, it is Manhattan distance
  • p = 2, it is Euclidean distance

We know that euclidean distance gets distorted by highly varying range of features while Manhattan distance tries to resolve it. Using this information, we can tweak p to give us something in the middle or can use a search algorithm to find a suitable p value that works for a given dataset.

Let’s add this new knowledge to our already defined class.

import numpy as np

class KNN:
    def __init__(self, k, d_metric='eucd', p=None):
        if d_metric == 'mind' and p==None:
            self.p = 1
        else:
            self.p=p
        self.k = k
        self.d_metric = d_metric
        self.d_metric_to_fn = {
            'eucd': self._euclidean_dist,
            'mand': self._manhattan_dist,
            'mind': self._minkowski_dist
        }

    def fit(self, X, y):
        self.X = np.copy(X)
        self.y = np.copy(y)

    def _manhattan_dist(self, x_test):
        return np.sum(np.abs(self.X - x_test), axis=-1)

    def _euclidean_dist(self, x_test):
        sq_diff = (self.X - x_test) ** 2
        return np.sqrt(np.sum(sq_diff, axis=-1))

    def _minkowski_dist(self, x_test):
        abs_diff = np.abs(self.X - x_test)
        sum_p_diff = np.sum(abs_diff ** self.p, axis=-1)
        pth_root = sum_p_diff ** (1 / p)

        return pth_root

    def distance(self, x_test):
        return self.d_metric_to_fn[self.d_metric](x_test)

    def predict(self, x_test):
        preds = []
        for index in range(x_test.shape[0]):
            # get similarities
            distances = self.distance(x_test[index])
            # Sort labels in ascending order of distances
            sorted_labels = self.y[np.argsort(distances)]
            # Take first k labels
            k_sorted_labels = sorted_labels[:self.k]
            # get unique labels and their counts
            unique_labels, counts = np.unique(k_sorted_labels, return_counts=True)
            # predicition is the label with maximum number of counts
            pred = unique_labels[np.argmax(counts)]
            preds.append(pred)
        return np.array(preds)

Now that we’ve defined the KNN class, let’s visually compare how these distance functions work. But what will work as a perfect dataset? How about a dataset that represents a circular decision boundary? That’s fun. Here, let’s define it:

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt

X, y = make_circles(n_samples=400, factor=.3, noise=.2)

plt.figure(dpi=300, figsize=(3, 3))
plt.scatter(X[y==1, 0], X[y==1, 1], marker='s', label='Inner Circle')
plt.scatter(X[y==0, 0], X[y==0, 1], marker='^', label='Outer Circle')
plt.legend()
plt.show()
def plot_predictions(clf, axes):
    x0s = np.linspace(axes[0], axes[1], 100)
    x1s = np.linspace(axes[2], axes[3], 100)
    x0, x1 = np.meshgrid(x0s, x1s)
    X = np.c_[x0.ravel(), x1.ravel()]
    y_pred = clf.predict(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)


# 2 KNN objects - Euclidean Distance and Manhattan Distance
eucd_knn = KNN(k=5)
mand_knn = KNN(k=5, d_metric='mand')

# Fit data to both objects
eucd_knn.fit(X, y)
mand_knn.fit(X, y)

# Plot predictions for both
plt.figure(figsize=(10, 4))
plt.subplot(121)
plot_predictions(eucd_knn, [-1.75, 1.75, -1.75, 1.75])
plt.scatter(X[y==1, 0], X[y==1, 1], marker='s', label='Inner Circle')
plt.scatter(X[y==0, 0], X[y==0, 1], marker='^', label='Outer Circle')
plt.title('Euclidean Distance')
plt.legend()

plt.subplot(122)
plot_predictions(mand_knn, [-1.75, 1.75, -1.75, 1.75])
plt.scatter(X[y==1, 0], X[y==1, 1], marker='s', label='Inner Circle')
plt.scatter(X[y==0, 0], X[y==0, 1], marker='^', label='Outer Circle')
plt.title('Manhattan Distance')
plt.legend()
plt.show()

We can certainly see a block of negative region around point (-0.5, 0.48) in the first image. Although it should have been a region of positive labels, the model was fooled due to the presence of nearby outliers. This is resolved in the second image via the use of Manhattan Distance.

Now, let’s see how minowski distance performs for different values of p.

def get_clf(p):
    # A wrapper function to get KNN with different
    # p values
    return KNN(k=5, d_metric='mind', p=p)

def scatter_plot(X, y):
    # scatter plot of original data
    plt.scatter(X[y==1, 0], X[y==1, 1], marker='s', label='Inner Circle')
    plt.scatter(X[y==0, 0], X[y==0, 1], marker='^', label='Outer Circle')

# We will be testing p values from 0.25 till 2
# increasing each time with a difference of 0.25
ps = np.arange(0.25, 2.25, 0.25)

plt.figure(dpi=300, figsize=(10, 14))

for i, p in enumerate(ps):
    # get a KNN clf
    clf = get_clf(p)
    # Fit data
    clf.fit(X, y)
    # plot it
    plt.subplot(4, 2, i+1)
    plot_predictions(clf, [-1.75, 1.75, -1.75, 1.75])
    scatter_plot(X, y)
    plt.title('p={}'.format(p))
    plt.legend()

plt.show()

After looking at these distance functions, you might have a question. If not, then let me ask you. What if none of the pre-defined distance functions is suitable enough for our problem statement? Naturally, we might be looking into creating our own distance measures. But to do so, we must take into account the properties that a distance measure should follow!

Properties of Distance Measure for KNN

There are many well-known distance measures, but you can certainly define your own. While defining a distance measure, remember these necessary properties that it should follow (Deza & Deza, 2009):

Note: For usage of distance measures, the properties matter but not their name. So just relax and focus on what the property means.

  • Non-Negativity: The distance between two points is always greater than and equal to zero.

d(x, y) ≥ 0

  • Identity of indiscernible: The distance between x and y is equal to zero if and only if x is equal to y.

d(x, y) = 0   iff  x = y

  • Symmetry: The distance between x and y is equal to the distance between y and x.

d(x, y) = d(y, x)

  • Triangle inequality: Considering the presence of a third point z, the distance between x and y is always less than or equal to the sum of the distance between x and z and the distance between y and z.

d(x, y) ≤ d(x, z) + d(z, y)

That concludes our discussion over distance measures. But we aren’t there yet!

A KNN classifier has two major players in classification process – the distance measure and the hyperparameter k. While we have gone over the former a bit, we still need to work our way through the latter one.

Significance of k in KNN

The k nearest data points to the point under observation has a huge role to play. Let’s try to answer this by considering 2 scenarios:

Scenario 1: Majority Wins

Out of k closest data points, the majority of points of one class declares the label for the point under observation.

  • if k = 1, (aka Nearest Neighbor) classification might be wrong if the closest point is an outlier.
  • if k = any multiple of n, where n = number of classes, no majority situation can occur. Example: For 2 class classification, what if we have k = 8 out of which 4 points belong to each class.

Advice:

  • Try to keep k = any multiple of n + 1, where n = number of classes and the multiple isn’t 0. This gives us a tie-breaker.
  • Keep a validation set and a range of values of k and select the value of k where your performance metric starts to saturate on the validation set.

Scenario 2: Distance Weighted Classification

What if you have k = 9 for a 2 class classification, out of which 4 says true and 5 says false. According to scenario, 1 prediction must be false! But what if the 4 closest points are all true? Should our prediction still be False? That’s where distance weighted KNN comes into play!

The idea of distance weighted classification is a 3 step process:

  • Convert distances of the closest k values to similarities. The closer the point, higher the similarity.

similarity(x, y) = 1 / d(x, y)

  • Find total similarity for each label in the closest k values.
  • The label with the highest total similarity will be our prediction.

Let’s try this with an example. We will be working with euclidean distance.

Point under observation:

x1 x2
1 3

Top k (=5) points

S No. x1 x2 Distance Similarity (= 1/distance) Label
1 1.3 2.8 0.3605 2.7735 False
2 1 2.5 0.5 2 True
3 1.1 3.5 0.509 1.9611 False
4 1 3.75 0.75 1.3333 True
5 1.9 2.9 0.728 1.3736 True
  • Total similarity for True label = 4.7069
  • Total similarity for False label = 4.7346

Since total similarity of False label is higher, even though we’ve a majority of True labels, our prediction will be False.

This might have given you an essence of how important hyperparameter k is in the KNN algorithm.

With all said and done, you might still be very vary of using KNN in production. Let’s learn why!

Run Time Analysis of KNN

Note: Let’s forget about matrix multiplication to simplify things. So let’s just say, we’ve 1 feature.

For the model based approaches, we usually have an order of O(nxm) time for training (.fit() – n examples/batches and m iterations) and an order O(1) time for a prediction. This high order training time is due to the usage of optimization algorithms like Gradient Descent while training while the constant time for prediction is due to simple weighted sum of the input features.

In contrast to model based approach, data driven KNN has an order of O(1) time for training and an order of O(n) time for a prediction. For the former, it is due to the simple assignment operation in the fit(…) method. And for the latter, it is due to the computation of distance between every training point and the test point.

At time of deployment, our major focus is on prediction time. Thus, KNN raises a major issue of increased prediction time.

Although we can overcome this using hacks like parallelization, but these hacks require usage of more computation equipments making this an expensive approach.

This brings us to the end of the blog post. KNN is a powerful technique and can do marvels but requires more computation at time of prediction. Hope to see you next time. 🙂

Leave a Reply

Close Menu

SUBSCRIBE FOR WEEKLY POST UPDATES <3