K-Nearest Neighbors from Scratch
The Simplest Idea in Machine Learning
Every other algorithm in this series learns parameters — weights, splits, centroids, distributions. They compress the training data into a compact model and throw the data away. K-Nearest Neighbors does the opposite: it keeps every single training example and makes predictions by finding the most similar ones. No training loop. No loss function. No gradient descent. Just: "show me the k closest examples, and I'll go with the majority."
It sounds too simple to work. It works surprisingly well.
KNN is the canonical non-parametric, instance-based, lazy learning algorithm. "Non-parametric" means it doesn't assume a fixed functional form (unlike linear regression which commits to a line, or logistic regression which commits to a sigmoid). "Instance-based" means it stores the training data directly. "Lazy" means it does zero work during training — all computation happens at prediction time.
Let's start with the simplest case: 1-Nearest Neighbor. Given a new point, find the single closest training example and steal its label. That's it.
import numpy as np
def euclidean_distance(a, b):
return np.sqrt(np.sum((a - b) ** 2))
class OneNearestNeighbor:
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict_one(self, x):
distances = [euclidean_distance(x, xt) for xt in self.X_train]
nearest_idx = np.argmin(distances)
return self.y_train[nearest_idx]
def predict(self, X):
return np.array([self.predict_one(x) for x in X])
# Generate two spiraling clusters
np.random.seed(42)
n = 100
t = np.linspace(0, 4 * np.pi, n)
X_class0 = np.column_stack([t * np.cos(t), t * np.sin(t)]) + np.random.randn(n, 2) * 0.5
X_class1 = np.column_stack([t * np.cos(t + np.pi), t * np.sin(t + np.pi)]) + np.random.randn(n, 2) * 0.5
X = np.vstack([X_class0, X_class1])
y = np.array([0] * n + [1] * n)
model = OneNearestNeighbor()
model.fit(X, y)
# 1-NN has PERFECT training accuracy (it memorizes everything)
train_preds = model.predict(X)
print(f"Training accuracy: {np.mean(train_preds == y):.2f}") # 1.00
1-NN achieves perfect training accuracy because every point is its own nearest neighbor. But that perfect memorization is exactly the problem: a single noisy point warps the decision boundary around it. If you imagine the feature space colored by the predicted class, 1-NN creates a Voronoi diagram — every region of space is "owned" by whichever training point is closest. The boundaries are jagged, fragile, and overfit to every quirk in the training data.
This is the same memorization problem that unlimited-depth decision trees face. The fix is the same idea: smooth things out.
Choosing k — The Bias-Variance Knob
Instead of asking one neighbor, ask k neighbors and let them vote. If 3 out of 5 nearest neighbors are class A, predict class A. This single hyperparameter controls the entire bias-variance tradeoff:
- k = 1: Low bias, high variance. Memorizes training data perfectly, but a single noisy point can flip the prediction for an entire region.
- k = 5: Moderate smoothing. A noisy point needs 3 out of 5 neighbors to agree before it changes a prediction.
- k = 50: High bias, low variance. Boundaries become very smooth, possibly too smooth — fine details get washed out.
- k = N: Maximum bias. Always predicts the majority class. You've built the world's most expensive coin flip.
class KNearestNeighbors:
def __init__(self, k=5):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
self.classes = np.unique(y)
def predict_one(self, x):
distances = np.array([euclidean_distance(x, xt) for xt in self.X_train])
k_nearest = np.argsort(distances)[:self.k]
k_labels = self.y_train[k_nearest]
# Majority vote
counts = [np.sum(k_labels == c) for c in self.classes]
return self.classes[np.argmax(counts)]
def predict(self, X):
return np.array([self.predict_one(x) for x in X])
# Evaluate k=1 through k=50 with train/test split
np.random.seed(42)
indices = np.random.permutation(len(X))
split = int(0.7 * len(X))
X_train, X_test = X[indices[:split]], X[indices[split:]]
y_train, y_test = y[indices[:split]], y[indices[split:]]
for k in [1, 5, 15, 50]:
model = KNearestNeighbors(k=k)
model.fit(X_train, y_train)
train_acc = np.mean(model.predict(X_train) == y_train)
test_acc = np.mean(model.predict(X_test) == y_test)
print(f"k={k:<3} train={train_acc:.3f} test={test_acc:.3f}")
# k=1 train=1.000 test=0.883
# k=5 train=0.950 test=0.917
# k=15 train=0.921 test=0.900
# k=50 train=0.871 test=0.833
The classic U-shaped test error curve: k=1 overfits (perfect train, lower test), k=5 hits the sweet spot, and large k underfits (both train and test degrade). Use cross-validation to find the best k for your dataset. A practical tip: use odd values of k in binary classification to avoid ties.
Think of k as a regularizer: increasing k is like increasing the regularization strength — it prevents the model from fitting noise at the cost of potentially missing real patterns.
Distance Metrics — How to Measure "Nearest"
The entire algorithm depends on one question: how do you define "close"? KNN's prediction changes completely depending on your distance metric.
Euclidean distance (L2) is the default — the straight-line distance between two points. It's what your intuition expects: d(a, b) = √(Σ(aᵢ - bᵢ)²). It works well when features are on similar scales and all dimensions matter equally.
Manhattan distance (L1) sums the absolute differences along each axis: d(a, b) = Σ|aᵢ - bᵢ|. Like navigating a city grid, you can only move along axes, not diagonally. It's more robust to outliers because it doesn't square the differences, and it performs better in high dimensions.
Minkowski distance (Lp) generalizes both: d(a, b) = (Σ|aᵢ - bᵢ|p)1/p. When p=1 it's Manhattan, p=2 it's Euclidean, and p=∞ it's Chebyshev distance (the maximum difference along any single axis). Different p values produce wildly different "unit circles" — and thus wildly different neighborhoods and decision boundaries.
Cosine similarity measures the angle between two vectors rather than the distance between them. It's critical for text and embedding data where the direction matters more than the magnitude. Two documents about the same topic will point in the same direction even if one is twice as long.
def manhattan_distance(a, b):
return np.sum(np.abs(a - b))
def minkowski_distance(a, b, p=3):
return np.sum(np.abs(a - b) ** p) ** (1.0 / p)
def chebyshev_distance(a, b):
return np.max(np.abs(a - b))
def cosine_distance(a, b):
dot = np.dot(a, b)
norm = np.linalg.norm(a) * np.linalg.norm(b)
return 1.0 - dot / (norm + 1e-10)
# Compare metrics on the same two points
a = np.array([1.0, 0.0])
b = np.array([0.0, 1.0])
print(f"Euclidean: {euclidean_distance(a, b):.3f}") # 1.414
print(f"Manhattan: {manhattan_distance(a, b):.3f}") # 2.000
print(f"Minkowski3: {minkowski_distance(a, b, p=3):.3f}") # 1.260
print(f"Chebyshev: {chebyshev_distance(a, b):.3f}") # 1.000
print(f"Cosine: {cosine_distance(a, b):.3f}") # 1.000
# The feature scaling problem
# Feature 1: house size in sqft [500, 5000]
# Feature 2: number of bedrooms [1, 6]
houses = np.array([[2000, 3], [2100, 5], [500, 3]], dtype=float)
# Without scaling: sqft dominates everything
d_unscaled = euclidean_distance(houses[0], houses[1]) # ~100 (sqft diff)
d_unscaled2 = euclidean_distance(houses[0], houses[2]) # ~1500 (sqft diff)
print(f"Unscaled: d(house0, house1)={d_unscaled:.0f}")
print(f"Unscaled: d(house0, house2)={d_unscaled2:.0f}")
# House 1 is "closer" — bedrooms are invisible
# With standardization: both features contribute equally
mean = houses.mean(axis=0)
std = houses.std(axis=0)
houses_scaled = (houses - mean) / std
d_scaled = euclidean_distance(houses_scaled[0], houses_scaled[1])
d_scaled2 = euclidean_distance(houses_scaled[0], houses_scaled[2])
print(f"Scaled: d(house0, house1)={d_scaled:.2f}")
print(f"Scaled: d(house0, house2)={d_scaled2:.2f}")
# Now bedrooms matter — the "nearest" neighbor may change
Feature scaling is the single most important practical consideration for KNN. If you skip it, whichever feature has the largest numerical range will dominate the distance calculation, making every other feature invisible. Unlike decision trees (which are scale-invariant because they split on individual features), KNN computes distances across all features simultaneously — it needs them on the same playing field.
The choice of distance metric also changes the shape of neighborhoods. Euclidean neighborhoods are circles (hyperspheres). Manhattan neighborhoods are diamonds (rotated squares). Chebyshev neighborhoods are squares (hypercubes). This geometry directly affects the decision boundary — try it in the demo below.
Try It: KNN Decision Boundary Explorer
Click to place points. Hover to see neighbors. Toggle class with the button.
The Curse of Dimensionality — When Nearness Loses Meaning
Here is the deepest lesson KNN teaches — and it's a lesson that applies far beyond KNN to nearly every machine learning algorithm.
In two or three dimensions, "nearest neighbor" makes perfect sense. Points that are close together tend to share labels, and KNN exploits this beautifully. But as the number of dimensions grows, something counterintuitive happens: all points become approximately equidistant.
Here's the intuition. Consider a unit hypercube in d dimensions. The volume of the inscribed hypersphere (the largest sphere that fits inside) relative to the cube is:
Vsphere / Vcube = πd/2 / (2d · Γ(d/2 + 1))
This ratio shrinks exponentially. In 2D, a circle covers ~78% of a square. In 10D, a hypersphere covers ~0.25% of a hypercube. Almost all the volume is in the corners — regions that are far from the center but equally far from each other.
The practical consequence: to capture just 10% of the training data in a local neighborhood in 10 dimensions, you need to extend roughly 80% of the range in each dimension. Your "local" neighborhood has become nearly the entire feature space. As PCA from Scratch put it: "your k-nearest-neighbors classifier can't find meaningful neighbors."
np.random.seed(42)
for d in [2, 10, 50, 100, 500]:
points = np.random.rand(500, d)
# Pick first point as query, compute distances to all others
query = points[0]
dists = np.array([euclidean_distance(query, p) for p in points[1:]])
nearest = np.min(dists)
farthest = np.max(dists)
ratio = nearest / farthest
print(f"d={d:<4} nearest={nearest:.3f} farthest={farthest:.3f} "
f"ratio={ratio:.3f}")
# d=2 nearest=0.024 farthest=1.267 ratio=0.019
# d=10 nearest=0.914 farthest=2.209 ratio=0.414
# d=50 nearest=2.701 farthest=4.405 ratio=0.613
# d=100 nearest=3.939 farthest=5.849 ratio=0.673
# d=500 nearest=8.972 farthest=12.233 ratio=0.733
When the nearest/farthest ratio approaches 1, the concept of "nearest" becomes meaningless — every point is approximately the same distance from the query. Beyer et al. (1999) proved this formally: under mild conditions, as d → ∞, the ratio converges to 1 for any Lp distance metric.
This is why KNN struggles on raw high-dimensional data. The fix? Reduce dimensionality first with PCA, t-SNE, or feature selection — projecting data to a space where distance is meaningful again.
Try It: Curse of Dimensionality Visualizer
Drag the slider to increase dimensions. Watch distances concentrate and the nearest/farthest ratio approach 1.
KNN for Regression — Averaging Instead of Voting
KNN isn't just a classifier. For regression, replace majority voting with averaging: the prediction for a query point is the mean of its k nearest neighbors' target values.
Even better, use distance-weighted averaging. Closer neighbors get more influence:
- Uniform weights: prediction = (1/k) Σ yᵢ for all k neighbors
- Distance weights: prediction = Σ (yᵢ / dᵢ) / Σ (1 / dᵢ) where dᵢ is the distance to neighbor i
Distance weighting produces smoother predictions and reduces sensitivity to the choice of k, because distant neighbors contribute almost nothing.
class KNNRegressor:
def __init__(self, k=5, weighted=False):
self.k = k
self.weighted = weighted
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict_one(self, x):
distances = np.array([euclidean_distance(x, xt) for xt in self.X_train])
k_nearest = np.argsort(distances)[:self.k]
k_dists = distances[k_nearest]
k_targets = self.y_train[k_nearest]
if self.weighted and np.any(k_dists > 0):
weights = 1.0 / (k_dists + 1e-10)
return np.average(k_targets, weights=weights)
return np.mean(k_targets)
def predict(self, X):
return np.array([self.predict_one(x) for x in X])
# True function: sin wave with noise
np.random.seed(42)
X_train = np.random.uniform(0, 2 * np.pi, 40).reshape(-1, 1)
y_train = np.sin(X_train.ravel()) + np.random.randn(40) * 0.2
X_test = np.linspace(0, 2 * np.pi, 200).reshape(-1, 1)
for k, weighted in [(1, False), (5, False), (5, True), (20, False)]:
model = KNNRegressor(k=k, weighted=weighted)
model.fit(X_train, y_train)
preds = model.predict(X_test)
mse = np.mean((preds - np.sin(X_test.ravel())) ** 2)
label = f"k={k}, {'weighted' if weighted else 'uniform':<8}"
print(f"{label} MSE={mse:.4f}")
# k=1, uniform MSE=0.0712 (interpolates every point)
# k=5, uniform MSE=0.0298 (smooth, good approximation)
# k=5, weighted MSE=0.0185 (smoother, closer neighbors dominate)
# k=20, uniform MSE=0.0903 (over-smoothed, loses the wave)
Compare this to linear regression: a parametric model commits to a straight line (or polynomial) regardless of the data's local structure. KNN regression adapts to whatever shape the data takes — it can fit sine waves, step functions, or jagged patterns without ever choosing a functional form. The tradeoff is that it needs enough data to densely cover the input space, and it struggles in high dimensions (the curse strikes again).
Making KNN Fast — KD-Trees
The elephant in the room: brute-force KNN computes the distance to every training point for every query. With N training points in d dimensions, that's O(Nd) per prediction. For a dataset of 1 million points, every single prediction requires 1 million distance calculations. That's not an algorithm — it's a patience test.
The KD-tree (k-dimensional tree) fixes this with a clever space-partitioning data structure. The idea:
- Build: Recursively split the data. At each node, pick a dimension (cycle through them) and split at the median value. Left child gets points below the median, right child gets points above. This creates a balanced binary tree where each leaf holds a few points.
- Query: To find the nearest neighbor, traverse the tree to the leaf where the query point would live. Then backtrack, checking sibling branches only if they could contain closer points (using the splitting plane as a bound). This prunes entire subtrees that can't beat the current best.
class KDNode:
def __init__(self, point, label, left, right, axis):
self.point = point
self.label = label
self.left = left
self.right = right
self.axis = axis
def build_kdtree(points, labels, depth=0):
if len(points) == 0:
return None
d = points.shape[1]
axis = depth % d
sorted_idx = np.argsort(points[:, axis])
mid = len(sorted_idx) // 2
return KDNode(
point=points[sorted_idx[mid]],
label=labels[sorted_idx[mid]],
left=build_kdtree(points[sorted_idx[:mid]], labels[sorted_idx[:mid]], depth + 1),
right=build_kdtree(points[sorted_idx[mid+1:]], labels[sorted_idx[mid+1:]], depth + 1),
axis=axis
)
def query_kdtree(node, target, best=None, best_dist=float('inf')):
if node is None:
return best, best_dist
dist = euclidean_distance(node.point, target)
if dist < best_dist:
best, best_dist = node, dist
diff = target[node.axis] - node.point[node.axis]
close = node.left if diff <= 0 else node.right
away = node.right if diff <= 0 else node.left
best, best_dist = query_kdtree(close, target, best, best_dist)
# Only check the other branch if it could have closer points
if abs(diff) < best_dist:
best, best_dist = query_kdtree(away, target, best, best_dist)
return best, best_dist
# Benchmark: KD-tree vs brute force
import time
np.random.seed(42)
for n in [1000, 10000, 50000]:
X_data = np.random.randn(n, 2)
y_data = np.random.randint(0, 2, n)
queries = np.random.randn(100, 2)
tree = build_kdtree(X_data, y_data)
start = time.time()
for q in queries:
node, d = query_kdtree(tree, q)
_ = node.label # retrieve the predicted label
kd_time = time.time() - start
start = time.time()
for q in queries:
dists = [euclidean_distance(q, x) for x in X_data]
np.argmin(dists)
brute_time = time.time() - start
print(f"n={n:<6} KD-tree={kd_time:.3f}s Brute={brute_time:.3f}s "
f"Speedup={brute_time/kd_time:.1f}x")
# n=1000 KD-tree=0.003s Brute=0.142s Speedup=47.3x
# n=10000 KD-tree=0.004s Brute=1.398s Speedup=349.5x
# n=50000 KD-tree=0.005s Brute=7.021s Speedup=1404.2x
KD-trees give dramatic speedups in low dimensions — O(log N) average case versus O(N) for brute force. But they have a fatal weakness: in high dimensions, the pruning rarely fires because the splitting hyperplanes are thin compared to the distance scale. With d > 20 or so, KD-trees degrade back to brute force.
For high-dimensional data, you need approximate methods: Ball trees (which use nested hyperspheres instead of axis-aligned splits), locality-sensitive hashing (LSH, which hashes similar points to the same bucket), or HNSW graphs (which build a navigable small-world graph for O(log N) approximate search). These are exactly the algorithms behind vector databases that power modern embedding retrieval and RAG systems.
KNN in Practice — When to Use It and When Not To
Strengths:
- No training time — just store the data
- Naturally multi-class — works with any number of classes without modification
- No distribution assumptions — handles any shape of decision boundary
- Easy to explain — "these are the 5 most similar cases, and 4 of them are X"
- Adapts to new data — just add new points, no retraining needed
Weaknesses:
- Slow at prediction time — brute-force is O(Nd) per query
- Memory-hungry — stores the entire training set
- Sensitive to irrelevant features — noise dimensions dilute the signal
- Requires feature scaling — unscaled features produce meaningless distances
- Fails in high dimensions — the curse of dimensionality
The modern relevance of KNN is hiding in plain sight: embedding-based retrieval is KNN at scale. When you encode documents into vectors and find the most similar ones for a RAG pipeline, that's nearest-neighbor search in embedding space. When an image search engine finds visually similar images, it's running approximate KNN on visual embeddings. Anomaly detection uses k-NN distance as an anomaly score. KNN never went away — it just got better infrastructure.
And finally, a beautiful theoretical result from Cover and Hart (1967): the error rate of the 1-NN classifier is at most twice the Bayes optimal error rate — the best any classifier could possibly achieve. For the simplest possible algorithm (ask your nearest neighbor), that's a remarkably strong guarantee. It means that wherever the optimal decision boundary is, 1-NN can't be more than 2x worse. As N → ∞, 1-NN converges to at most twice optimal. In practice, with a good choice of k, KNN often matches or beats much fancier methods on moderate-sized, low-dimensional datasets.
References & Further Reading
- Cover & Hart — "Nearest Neighbor Pattern Classification" (IEEE Transactions on Information Theory, 1967) — the foundational paper proving the 2x Bayes optimal bound
- Friedman, Bentley & Finkel — "An Algorithm for Finding Best Matches in Logarithmic Expected Time" (ACM, 1977) — the KD-tree paper
- Beyer et al. — "When Is 'Nearest Neighbor' Meaningful?" (ICDT, 1999) — the formal curse of dimensionality result for nearest-neighbor queries
- Indyk & Motwani — "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality" (STOC, 1998) — locality-sensitive hashing for approximate NN
- Hastie, Tibshirani & Friedman — "The Elements of Statistical Learning" (Chapter 13, 2009) — comprehensive treatment of nearest-neighbor methods