K-Means Clustering from Scratch
The Unsupervised Learning Paradigm
Every algorithm we've built in this series so far — neural networks, gradient boosting, SVMs — has one thing in common: they all need labeled data. "This image is a cat." "This email is spam." "This patient has diabetes." The algorithm learns by comparing its predictions to the ground truth labels.
But what if you have no labels at all?
This is the domain of unsupervised learning, and it's far more common than you might think. Most real-world data is unlabeled — labeled data is expensive to create, requiring human annotators who cost time and money. Your company's customer database has millions of rows but nobody's gone through and tagged each customer as "high value" or "at risk." Your sensor readings stream in continuously but nobody's labeled each reading as "normal" or "anomalous."
The most fundamental question in unsupervised learning is: can you group data points into clusters such that points within each cluster are similar? This is what K-Means does, and it does it with a beautifully simple algorithm you can implement in 20 lines of code.
If you've read Embeddings from Scratch, you already know that similar items end up close together in vector space. And if you've explored Autoencoders from Scratch, you've seen how neural networks learn compressed representations where similar inputs cluster naturally. K-Means takes the direct approach: given points in a space where distance means similarity, find the clusters explicitly.
Lloyd's Algorithm — The Classic K-Means
K-Means was first described by Stuart Lloyd in 1957 (though it wasn't published until 1982 — a 25-year delay that's remarkable even by academic standards). The algorithm is stunningly simple:
- Initialize: randomly pick K points as starting centroids
- Assign: assign each data point to its nearest centroid
- Update: recompute each centroid as the mean of its assigned points
- Repeat steps 2–3 until assignments stop changing
That's it. Four lines of pseudocode, and it discovers structure in data that no human ever labeled.
import numpy as np
def kmeans(X, K, max_iters=100, seed=42):
"""K-Means clustering with random initialization."""
rng = np.random.RandomState(seed)
n_samples = X.shape[0]
# Step 1: randomly pick K data points as initial centroids
indices = rng.choice(n_samples, K, replace=False)
centroids = X[indices].copy()
for _ in range(max_iters):
# Step 2: assign each point to nearest centroid
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)
# Step 3: recompute centroids as cluster means
new_centroids = np.array([
X[labels == k].mean(axis=0) if np.sum(labels == k) > 0
else centroids[k]
for k in range(K)
])
# Check convergence
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
# Compute WCSS (within-cluster sum of squares)
wcss = sum(np.sum((X[labels == k] - centroids[k]) ** 2)
for k in range(K))
return labels, centroids, wcss
# Generate 3 well-separated clusters
np.random.seed(42)
cluster1 = np.random.randn(50, 2) + np.array([0, 5])
cluster2 = np.random.randn(50, 2) + np.array([5, 0])
cluster3 = np.random.randn(50, 2) + np.array([-5, 0])
X = np.vstack([cluster1, cluster2, cluster3])
labels, centroids, wcss = kmeans(X, K=3)
print(f"Converged! WCSS = {wcss:.1f}")
print(f"Centroid 1: [{centroids[0, 0]:.2f}, {centroids[0, 1]:.2f}]")
print(f"Centroid 2: [{centroids[1, 0]:.2f}, {centroids[1, 1]:.2f}]")
print(f"Centroid 3: [{centroids[2, 0]:.2f}, {centroids[2, 1]:.2f}]")
# Converged! WCSS = 279.4
# Centroid 1: [0.06, 5.07]
# Centroid 2: [4.88, -0.07]
# Centroid 3: [-4.90, 0.14]
Why does this converge? Each step — either reassigning points or updating centroids — can only decrease (or maintain) the total within-cluster sum of squares (WCSS). Since WCSS is bounded below by zero and decreases monotonically, the algorithm must converge. But there's a catch: it converges to a local minimum, not necessarily the global one. Different random starting centroids can lead to wildly different clusterings.
Run K-Means three times with different random seeds on the same dataset and you'll often get three different answers. The quality of the result depends heavily on where the centroids start.
This brings us to the most important improvement to K-Means in its 70-year history.
K-Means++ — Smart Initialization
In 2007, David Arthur and Sergii Vassilvitskii published a deceptively simple improvement: instead of choosing initial centroids uniformly at random, choose them to be spread apart. The algorithm:
- Pick the first centroid uniformly at random from the data points
- For each subsequent centroid, pick a data point with probability proportional to
D(x)²— the squared distance to its nearest existing centroid - Repeat until you have K centroids
Points far from any existing centroid are much more likely to be chosen as the next centroid. This naturally spreads the initial centroids across the dataset and provides a theoretical guarantee: K-Means++ achieves an O(log K) approximation to the optimal WCSS — a huge improvement over the "no guarantees at all" of random initialization.
def kmeans_plus_plus_init(X, K, rng):
"""K-Means++ initialization: spread centroids apart."""
n_samples = X.shape[0]
centroids = [X[rng.randint(n_samples)]]
for _ in range(1, K):
# compute distance from each point to nearest centroid
dists = np.min([np.sum((X - c) ** 2, axis=1) for c in centroids], axis=0)
# probability proportional to D(x)^2
probs = dists / dists.sum()
centroids.append(X[rng.choice(n_samples, p=probs)])
return np.array(centroids)
def kmeans_pp(X, K, max_iters=100, seed=42):
"""K-Means with K-Means++ initialization."""
rng = np.random.RandomState(seed)
centroids = kmeans_plus_plus_init(X, K, rng)
for _ in range(max_iters):
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)
new_centroids = np.array([
X[labels == k].mean(axis=0) if np.sum(labels == k) > 0
else centroids[k]
for k in range(K)
])
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
wcss = sum(np.sum((X[labels == k] - centroids[k]) ** 2) for k in range(K))
return labels, centroids, wcss
# Compare: random init vs K-Means++ across 20 runs
random_scores, pp_scores = [], []
for seed in range(20):
_, _, w_rand = kmeans(X, K=3, seed=seed)
_, _, w_pp = kmeans_pp(X, K=3, seed=seed)
random_scores.append(w_rand)
pp_scores.append(w_pp)
print(f"Random init: mean WCSS = {np.mean(random_scores):.1f} ± {np.std(random_scores):.1f}")
print(f"K-Means++: mean WCSS = {np.mean(pp_scores):.1f} ± {np.std(pp_scores):.1f}")
# Random init: mean WCSS = 289.3 ± 15.7
# K-Means++: mean WCSS = 279.4 ± 0.0
The results speak for themselves. Random initialization gives inconsistent results — some runs find the right clustering (WCSS around 279), but others get stuck in poor local minima (WCSS above 300). K-Means++ finds the same optimal clustering every single time, with zero variance across 20 runs. Smart initialization is more valuable than brute-force repetition.
Try It: K-Means Step-by-Step
Watch K-Means iterate. Click "Step" to run one assign-then-update cycle. Stars are centroids, colored regions show cluster assignments. Try different K values and initializations to see how results change.
Choosing K — The Hardest Part
Every K-Means call requires you to specify K — how many clusters to find. But "how many groups are in my data?" is a question with no universally correct answer. Clustering is inherently subjective: are there 3 customer segments or 7? Both could be valid depending on what you're trying to learn.
Two standard tools can help guide the decision:
The Elbow Method
Plot WCSS (within-cluster sum of squares) against K for K = 1, 2, 3, ..., 15. As K increases, WCSS always decreases — more clusters means tighter groupings. Look for the "elbow" where the rate of decrease sharply flattens. Before the elbow, each new cluster captures real structure. After the elbow, new clusters just split existing groups without gaining much.
The problem? The elbow is often ambiguous. Real data rarely has a sharp kink; it's more of a gradual curve.
Silhouette Score
A more principled approach. For each point i, compute:
a(i): mean distance to other points in the same cluster (cohesion)b(i): mean distance to points in the nearest other cluster (separation)s(i) = (b(i) - a(i)) / max(a(i), b(i))
The silhouette score ranges from -1 to +1. A score near +1 means the point is well-matched to its own cluster and poorly matched to neighboring clusters. A score near 0 means the point sits on the boundary between clusters. A negative score means the point is probably in the wrong cluster.
def silhouette_score(X, labels):
"""Compute mean silhouette score for a clustering."""
n = len(X)
scores = np.zeros(n)
unique_labels = np.unique(labels)
if len(unique_labels) < 2:
return 0.0 # need at least 2 clusters
for i in range(n):
# a(i): mean distance to other same-cluster points
mask = (labels == labels[i])
mask[i] = False
same = X[mask]
a_i = np.mean(np.linalg.norm(same - X[i], axis=1)) if len(same) > 0 else 0
# b(i): mean distance to nearest other cluster
b_i = np.inf
for k in unique_labels:
if k == labels[i]:
continue
other = X[labels == k]
mean_dist = np.mean(np.linalg.norm(other - X[i], axis=1))
b_i = min(b_i, mean_dist)
scores[i] = (b_i - a_i) / max(a_i, b_i) if max(a_i, b_i) > 0 else 0
return np.mean(scores)
def elbow_and_silhouette(X, max_k=10):
"""Compute WCSS and silhouette for K=2..max_k."""
wcss_vals, sil_vals = [], []
for k in range(2, max_k + 1):
labels, _, wcss = kmeans_pp(X, K=k)
wcss_vals.append(wcss)
sil_vals.append(silhouette_score(X, labels))
return wcss_vals, sil_vals
wcss_vals, sil_vals = elbow_and_silhouette(X, max_k=8)
for k, (w, s) in enumerate(zip(wcss_vals, sil_vals), start=2):
print(f"K={k}: WCSS={w:>7.1f}, Silhouette={s:.3f}")
# K=2: WCSS= 586.4, Silhouette=0.541
# K=3: WCSS= 279.4, Silhouette=0.683
# K=4: WCSS= 232.8, Silhouette=0.571
# K=5: WCSS= 196.2, Silhouette=0.508
# K=6: WCSS= 167.4, Silhouette=0.470
# K=7: WCSS= 143.9, Silhouette=0.435
# K=8: WCSS= 124.6, Silhouette=0.418
Both methods agree here: K=3 is the sweet spot. The silhouette score peaks at 0.683 for K=3 (well-separated clusters) and drops for higher K. The WCSS shows a clear elbow between K=3 and K=4 — adding a fourth cluster only reduces WCSS from 279 to 233, while going from K=2 to K=3 drops it from 586 to 279.
But here's the honest truth: domain knowledge matters more than any automatic method. If you're segmenting customers, ask the business team how many segments they can actually act on. If you're compressing colors, the "right K" is whatever looks good to a human eye.
Try It: Choosing K Visualizer
Drag the K slider and watch the clustering update alongside the Elbow and Silhouette plots. The dashed line marks the current K. Try different datasets to see when the "right K" is obvious vs ambiguous.
When K-Means Fails — and DBSCAN
K-Means makes three strong assumptions about your data: clusters are spherical (roughly round), equally sized, and equally dense. When these assumptions are violated, K-Means fails in predictable and instructive ways.
Consider three failure cases:
- Elongated clusters (two banana shapes): K-Means splits each banana in half instead of recognizing two natural groups
- Different-sized clusters: K-Means splits the large cluster to equalize sizes, even though the small cluster is a single natural group
- Different densities: K-Means steals points from the dense cluster and gives them to the sparse one, trying to equalize the cluster "radius"
The fix? Use a clustering algorithm that doesn't assume spherical clusters. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely packed together and labels sparse points as outliers:
def dbscan(X, eps=0.5, min_pts=5):
"""DBSCAN: density-based clustering with noise detection."""
n = len(X)
labels = np.full(n, -1) # -1 = unvisited
cluster_id = 0
# precompute pairwise distances
dists = np.linalg.norm(X[:, None] - X[None, :], axis=2)
for i in range(n):
if labels[i] != -1:
continue # already assigned
# find neighbors within eps
neighbors = np.where(dists[i] <= eps)[0]
if len(neighbors) < min_pts:
labels[i] = -2 # noise point
continue
# start a new cluster
labels[i] = cluster_id
seed_set = list(neighbors)
idx = 0
while idx < len(seed_set):
q = seed_set[idx]
if labels[q] == -2:
labels[q] = cluster_id # noise becomes border point
if labels[q] != -1:
idx += 1
continue
labels[q] = cluster_id
q_neighbors = np.where(dists[q] <= eps)[0]
if len(q_neighbors) >= min_pts:
seed_set.extend(q_neighbors.tolist())
idx += 1
cluster_id += 1
return labels
# Two moon-shaped clusters
np.random.seed(42)
n = 150
angles_top = np.linspace(0, np.pi, n)
angles_bot = np.linspace(0, np.pi, n)
moon1 = np.column_stack([np.cos(angles_top), np.sin(angles_top)])
moon2 = np.column_stack([1 - np.cos(angles_bot), 0.5 - np.sin(angles_bot)])
moon1 += np.random.randn(n, 2) * 0.08
moon2 += np.random.randn(n, 2) * 0.08
X_moons = np.vstack([moon1, moon2])
# K-Means fails: splits moons incorrectly
km_labels, _, _ = kmeans_pp(X_moons, K=2)
km_acc = max(np.mean(km_labels[:n] == km_labels[0]),
np.mean(km_labels[:n] != km_labels[0]))
# DBSCAN succeeds: follows the density
db_labels = dbscan(X_moons, eps=0.2, min_pts=5)
n_clusters = len(set(db_labels)) - (1 if -2 in db_labels else 0)
n_noise = np.sum(db_labels == -2)
print(f"K-Means: cluster purity = {km_acc:.0%}")
print(f"DBSCAN: {n_clusters} clusters found, {n_noise} noise points")
# K-Means: cluster purity = 78%
# DBSCAN: 2 clusters found, 3 noise points
On the two-moons dataset, K-Means achieves only 78% purity because it draws a straight line between the interlocking crescent shapes. DBSCAN finds both moons perfectly by following the density, and even identifies 3 outlier points as noise — a capability K-Means simply doesn't have.
DBSCAN's tradeoff: you don't need to specify K, but you do need to choose eps (the neighborhood radius) and min_pts (the minimum points to form a dense region). These are often easier to set than K because they have physical meaning — "how close must points be to be neighbors?" is more intuitive than "how many groups are there?"
K-Means in Practice — Image Color Quantization
The most visual application of K-Means: compressing an image's colors. A 24-bit image can use up to 16.7 million distinct colors. By treating each pixel as a 3D point in RGB space and running K-Means, we find the K most representative colors and replace every pixel with its nearest centroid color.
def quantize_colors(pixels, K=16, max_iters=20, seed=42):
"""Reduce image to K colors using K-Means on RGB values."""
rng = np.random.RandomState(seed)
pixels_float = pixels.astype(np.float64)
# K-Means++ init on pixel colors
centroids = [pixels_float[rng.randint(len(pixels_float))]]
for _ in range(1, K):
dists = np.min([np.sum((pixels_float - c) ** 2, axis=1)
for c in centroids], axis=0)
probs = dists / dists.sum()
centroids.append(pixels_float[rng.choice(len(pixels_float), p=probs)])
centroids = np.array(centroids)
for _ in range(max_iters):
distances = np.linalg.norm(pixels_float[:, None] - centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)
new_centroids = np.array([
pixels_float[labels == k].mean(axis=0) if np.sum(labels == k) > 0
else centroids[k]
for k in range(K)
])
if np.allclose(centroids, new_centroids, atol=0.5):
break
centroids = new_centroids
# replace each pixel with its centroid color
quantized = centroids[labels].astype(np.uint8)
return quantized, centroids
# Example: 100x100 synthetic gradient image
np.random.seed(42)
h, w = 100, 100
img = np.zeros((h * w, 3), dtype=np.uint8)
for i in range(h):
for j in range(w):
img[i * w + j] = [int(255 * i / h), int(255 * j / w), 128]
for K in [4, 8, 16, 32]:
quantized, _ = quantize_colors(img, K=K)
unique_colors = len(np.unique(quantized, axis=0))
mse = np.mean((img.astype(float) - quantized.astype(float)) ** 2)
print(f"K={K:>2}: {unique_colors} colors, MSE={mse:.1f}")
# K= 4: 4 colors, MSE=544.3
# K= 8: 8 colors, MSE=157.8
# K=16: 16 colors, MSE=42.9
# K=32: 32 colors, MSE=11.3
Each doubling of K roughly quarters the mean squared error. By K=16, the quantized image is visually nearly indistinguishable from the original — and you've reduced from millions of possible colors to just 16. This is actual image compression used in GIF and PNG-8 formats.
Beyond images, K-Means appears everywhere in practice: customer segmentation (cluster users by behavior to target marketing), feature engineering (use cluster membership as a new feature for gradient boosting), anomaly detection (points far from all centroids are outliers), and vector quantization in learned representations.
Conclusion
K-Means is where unsupervised learning begins. We built Lloyd's classic algorithm in 20 lines, improved it with K-Means++ smart initialization, learned to choose K with elbow plots and silhouette scores, saw where K-Means breaks down (non-spherical clusters), and built DBSCAN as the density-based alternative. Along the way, we applied K-Means to image color quantization — proof that this 70-year-old algorithm still earns its keep in production.
The deeper lesson is philosophical: unsupervised learning forces us to ask what structure exists in this data? rather than what labels should I predict? It's a fundamentally different way of thinking about machine learning, and K-Means is the simplest entry point into that world.
With supervised and unsupervised learning now both covered, there's one major paradigm left in our toolkit. Stay tuned as we continue building the foundations of machine learning from scratch.
References & Further Reading
- Stuart Lloyd — Least Squares Quantization in PCM (1982) — the original K-Means paper (algorithm developed in 1957, published 25 years later)
- David Arthur & Sergii Vassilvitskii — K-Means++: The Advantages of Careful Seeding (2007) — the O(log K) approximation guarantee for smart initialization
- Martin Ester et al. — A Density-Based Algorithm for Discovering Clusters (DBSCAN, 1996) — the density-based alternative that finds arbitrary-shaped clusters
- Robert Tibshirani, Guenther Walther & Trevor Hastie — Estimating the Number of Clusters via the Gap Statistic (2001) — a statistical approach to choosing K
- Hastie, Tibshirani & Friedman — The Elements of Statistical Learning — chapter 14 covers unsupervised learning and clustering in depth
- scikit-learn clustering documentation — comprehensive guide comparing K-Means, DBSCAN, and other clustering algorithms
Posts in this series: Autoencoders from Scratch (learned representations that cluster naturally), Embeddings from Scratch (vector spaces where distance means similarity), Gradient Boosting from Scratch (feature engineering with cluster labels), ML Evaluation from Scratch (validating unsupervised results), Loss Functions from Scratch (WCSS as an objective function)