Introducing KNN: K Nearest Neighbors

Posted on Sep 25th, 2024 by Felipe Aníbal

Last updated on Sep 26th, 2024

K Nearest Neighbors or simply KNN is a family of classifier algorithms that is conceptually easy to understand and to implement and nonetheless may achieve great results. In this post we will see a quick intuition behind KNN, a more formal description of it, some examples and use cases and some data preparation techniques that might help us make a better classifier.

Gaining intuition

A classifier is essentially an algorithm that assigns a class to a given input. An algorithm that decides whether to approve or deny a loan to a person is an example of a classifier. Given an input — the information about a person — it assigns that person a class, such as "deserving credit" or "not deserving credit". This kind of decision is usually made from data collected in the past. If many people with given characteristics received a loan and were able to repay it, then someone with similar characteristics should likely be able to repay the loan as well. Likewise, if many people with certain characteristics were not able to repay the loan, then someone in similar conditions might also struggle to repay it.

The reasoning above is precisely what is behind KNN. The idea is that if we look at a new input, we can assign it the same label of its neighbors, that is, the same label as similar inputs!

The idea above should intuitively make sense, but it is quite challenging to translate into a mathematical or programable idea. What do we mean by "similar characteristics". How can we quantify it?

A more formal definition

To translate the above idea into a more formal definition, we can use a mathematical concept from linear algebra. We can think of each input with N attributes as an N-dimensional vector. Better yet, we can think of it as a point in N-dimensional space. That way, given a new point, we assign it the same label as the most frequent label among its neighbors. For instance, we can look at the 3 points that are closest to the new point and assign the new point the label that appears most frequently. That would be a 3-NN (3-nearest neighbors) classifier. We could also look at the 30 nearest points, and that would be 30-NN. We can vary the value of K to adjust our classifier.

A problem that arises when thinking of the input as a vector is that we can only calculate the distances between two numerical vectors. How could we measure the distance between features like "Married" and "Single"? This is a problem we will address in a moment. But for now, let's try to implement the above idea in an example using only numerical inputs.

KNN hands-on: the Adult dataset

The University of California, Irvine, has published a dataset with information about approximately 32,000 people from the US 1994 Census. This information includes the number of years a person has studied, their capital gain and capital loss, their age, the number of hours they work per week, and their income. The income only shows whether that person earns more than $50,000 a year. Given information about a new person, except for their income, can we guess if they earn more than $50,000?

The code below uses the scikit-learn library to implement a KNN that tries to make this prediction in python. The data used for this example can be found at the UCI machine learning repository.

# Import pandas library
import pandas as pd

# Read the training data and mark the "?" as missing data
adult = pd.read_csv("train_data.csv", na_values="?")

# Select only the columns with numerical values
xAdult = adult[["age", "education.num", "capital.gain", "capital.loss", "hours.per.week"]]

yAdult = list(nadult["income"])

# Import KNN from the scikit-learn library
from sklearn.neighbors import KNeighborsClassifier

# We will use K=30
knn = KNeighborsClassifier(n_neighbors=30)

# Use cross validation to have an idea of our model's accuracy
from sklearn.model_selection import cross_val_score

scores = cross_val_score(knn, xAdult, yAdult, cv=10)
avg = sum(scores)/10

Using the model above on the test data (also available at the UCI Machine Learning Repository), we achieved an accuracy of around 82%.

Dealing with categorical features

We saw above that our classifier makes a correct guess around 82% of the time. Let's see if we can use more information from the dataset to improve our prediction. We will turn our categorical features — the ones that don't have numerical values — into numerical features. There are two very common ways of doing that: label encoding and one-hot encoding.

Label encoding consists of assigning a number to each value of a categorical feature in the order they appear. For example, if we look at the feature "relationship," we can assign 0 to Husband, 1 to Wife, 2 to Own-child, 3 to Not-in-family, 4 to Other-relative, and 5 to Unmarried. Now we can feed that information to our classifier. We can do the same for each of our other categorical features and then feed them into our classifier as well.

Another strategy, known as one-hot encoding, is to encode each value in a categorical feature as a new binary feature. That means we can create new columns in our dataset like Husband, Wife, Own-child, Not-in-family, Other-relative, and Unmarried, each with values of 1 if a person fits that category and 0 if they don't.

The second strategy has the advantage of not creating arbitrary magnitude differences and order between the values. What does it mean, for instance, that "Other-relative" is twice "Own-child" as the label encoding implies? Or why is "Husband" further from "Other-relative" than from "Wife"? On the other hand, one-hot encoding can dramatically increase the size of our dataset. In our example, if we encode each categorical feature using one-hot encoding we go from around 15 columns to more than 100 columns.

Let's see if including categorical data using one-hot encoding can improve the accuracy of our classifier.

KNN hands-on (part II):

# Pandas has a method called get_dummies to do one-hot encoding
encodedAdult = pd.get_dummies(nadult, columns=["workclass", "education", "marital.status",
"relationship", "occupation", "race", "sex"], dummy_na=True)

# For our classifier we will not look at "Id" and "native.country".
# You can change that and see if it improves accuracy!
encodedAdultX = encodedAdult.drop(columns=["fnlwgt", "income", "Id", "native.country"])

# Import KNN from the scikit-learn library
from sklearn.neighbors import KNeighborsClassifier

# We will use K=30
knn = KNeighborsClassifier(n_neighbors=30)

# Use cross validation to have an idea of our model's accuracy
from sklearn.model_selection import cross_val_score

scores = cross_val_score(knn, xAdult, yAdult, cv=10)
avg = sum(scores)/10

With the model above, we get an accuracy of around 85% on the test data.

Conclusion

KNN is a simple, but powerful algorithm for classification. We can couple it with data preparation strategies like one-hot encoding and Label encoding and use it to quickly build a classifier. One disadvantage worth mentioning of KNN over other algorithms like neural-networks is performance when making a prediction. In order to classify a new value the KNN classifier has to look at every single point already at the dataset and sort then according to their distance. That might take a very long time for big enough datasets.

The complete code with the examples mentioned here is available at this repository