K-Means Clustering from Scratch

February 26, 2026 · Elementary · 14 min read

The Unsupervised Learning Paradigm

Every algorithm we've built in this series so far — neural networks, gradient boosting, SVMs — has one thing in common: they all need labeled data. "This image is a cat." "This email is spam." "This patient has diabetes." The algorithm learns by comparing its predictions to the ground truth labels.

But what if you have no labels at all?

This is the domain of unsupervised learning, and it's far more common than you might think. Most real-world data is unlabeled — labeled data is expensive to create, requiring human annotators who cost time and money. Your company's customer database has millions of rows but nobody's gone through and tagged each customer as "high value" or "at risk." Your sensor readings stream in continuously but nobody's labeled each reading as "normal" or "anomalous."

The most fundamental question in unsupervised learning is: can you group data points into clusters such that points within each cluster are similar? This is what K-Means does, and it does it with a beautifully simple algorithm you can implement in 20 lines of code.

If you've read Embeddings from Scratch, you already know that similar items end up close together in vector space. And if you've explored Autoencoders from Scratch, you've seen how neural networks learn compressed representations where similar inputs cluster naturally. K-Means takes the direct approach: given points in a space where distance means similarity, find the clusters explicitly.

Lloyd's Algorithm — The Classic K-Means

K-Means was first described by Stuart Lloyd in 1957 (though it wasn't published until 1982 — a 25-year delay that's remarkable even by academic standards). The algorithm is stunningly simple:

Initialize: randomly pick K points as starting centroids
Assign: assign each data point to its nearest centroid
Update: recompute each centroid as the mean of its assigned points
Repeat steps 2–3 until assignments stop changing

That's it. Four lines of pseudocode, and it discovers structure in data that no human ever labeled.

import numpy as np

def kmeans(X, K, max_iters=100, seed=42):
    """K-Means clustering with random initialization."""
    rng = np.random.RandomState(seed)
    n_samples = X.shape[0]

    # Step 1: randomly pick K data points as initial centroids
    indices = rng.choice(n_samples, K, replace=False)
    centroids = X[indices].copy()

    for _ in range(max_iters):
        # Step 2: assign each point to nearest centroid
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        labels = np.argmin(distances, axis=1)

        # Step 3: recompute centroids as cluster means
        new_centroids = np.array([
            X[labels == k].mean(axis=0) if np.sum(labels == k) > 0
            else centroids[k]
            for k in range(K)
        ])

        # Check convergence
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    # Compute WCSS (within-cluster sum of squares)
    wcss = sum(np.sum((X[labels == k] - centroids[k]) ** 2)
               for k in range(K))
    return labels, centroids, wcss

# Generate 3 well-separated clusters
np.random.seed(42)
cluster1 = np.random.randn(50, 2) + np.array([0, 5])
cluster2 = np.random.randn(50, 2) + np.array([5, 0])
cluster3 = np.random.randn(50, 2) + np.array([-5, 0])
X = np.vstack([cluster1, cluster2, cluster3])

labels, centroids, wcss = kmeans(X, K=3)
print(f"Converged! WCSS = {wcss:.1f}")
print(f"Centroid 1: [{centroids[0, 0]:.2f}, {centroids[0, 1]:.2f}]")
print(f"Centroid 2: [{centroids[1, 0]:.2f}, {centroids[1, 1]:.2f}]")
print(f"Centroid 3: [{centroids[2, 0]:.2f}, {centroids[2, 1]:.2f}]")
# Converged! WCSS = 279.4
# Centroid 1: [0.06, 5.07]
# Centroid 2: [4.88, -0.07]
# Centroid 3: [-4.90, 0.14]

Why does this converge? Each step — either reassigning points or updating centroids — can only decrease (or maintain) the total within-cluster sum of squares (WCSS). Since WCSS is bounded below by zero and decreases monotonically, the algorithm must converge. But there's a catch: it converges to a local minimum, not necessarily the global one. Different random starting centroids can lead to wildly different clusterings.

Run K-Means three times with different random seeds on the same dataset and you'll often get three different answers. The quality of the result depends heavily on where the centroids start.

This brings us to the most important improvement to K-Means in its 70-year history.

K-Means++ — Smart Initialization

In 2007, David Arthur and Sergii Vassilvitskii published a deceptively simple improvement: instead of choosing initial centroids uniformly at random, choose them to be spread apart. The algorithm:

Pick the first centroid uniformly at random from the data points
For each subsequent centroid, pick a data point with probability proportional to D(x)² — the squared distance to its nearest existing centroid
Repeat until you have K centroids

Points far from any existing centroid are much more likely to be chosen as the next centroid. This naturally spreads the initial centroids across the dataset and provides a theoretical guarantee: K-Means++ achieves an O(log K) approximation to the optimal WCSS — a huge improvement over the "no guarantees at all" of random initialization.

def kmeans_plus_plus_init(X, K, rng):
    """K-Means++ initialization: spread centroids apart."""
    n_samples = X.shape[0]
    centroids = [X[rng.randint(n_samples)]]

    for _ in range(1, K):
        # compute distance from each point to nearest centroid
        dists = np.min([np.sum((X - c) ** 2, axis=1) for c in centroids], axis=0)
        # probability proportional to D(x)^2
        probs = dists / dists.sum()
        centroids.append(X[rng.choice(n_samples, p=probs)])

    return np.array(centroids)

def kmeans_pp(X, K, max_iters=100, seed=42):
    """K-Means with K-Means++ initialization."""
    rng = np.random.RandomState(seed)
    centroids = kmeans_plus_plus_init(X, K, rng)

    for _ in range(max_iters):
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        labels = np.argmin(distances, axis=1)
        new_centroids = np.array([
            X[labels == k].mean(axis=0) if np.sum(labels == k) > 0
            else centroids[k]
            for k in range(K)
        ])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    wcss = sum(np.sum((X[labels == k] - centroids[k]) ** 2) for k in range(K))
    return labels, centroids, wcss

# Compare: random init vs K-Means++ across 20 runs
random_scores, pp_scores = [], []
for seed in range(20):
    _, _, w_rand = kmeans(X, K=3, seed=seed)
    _, _, w_pp = kmeans_pp(X, K=3, seed=seed)
    random_scores.append(w_rand)
    pp_scores.append(w_pp)

print(f"Random init: mean WCSS = {np.mean(random_scores):.1f} ± {np.std(random_scores):.1f}")
print(f"K-Means++:   mean WCSS = {np.mean(pp_scores):.1f} ± {np.std(pp_scores):.1f}")
# Random init: mean WCSS = 289.3 ± 15.7
# K-Means++:   mean WCSS = 279.4 ± 0.0

The results speak for themselves. Random initialization gives inconsistent results — some runs find the right clustering (WCSS around 279), but others get stuck in poor local minima (WCSS above 300). K-Means++ finds the same optimal clustering every single time, with zero variance across 20 runs. Smart initialization is more valuable than brute-force repetition.

Try It: K-Means Step-by-Step

Watch K-Means iterate. Click "Step" to run one assign-then-update cycle. Stars are centroids, colored regions show cluster assignments. Try different K values and initializations to see how results change.

K: 3

Init:

Data:

Iteration: 0 | WCSS: — | Click "Step" to begin

Choosing K — The Hardest Part

Every K-Means call requires you to specify K — how many clusters to find. But "how many groups are in my data?" is a question with no universally correct answer. Clustering is inherently subjective: are there 3 customer segments or 7? Both could be valid depending on what you're trying to learn.

Two standard tools can help guide the decision:

The Elbow Method

Plot WCSS (within-cluster sum of squares) against K for K = 1, 2, 3, ..., 15. As K increases, WCSS always decreases — more clusters means tighter groupings. Look for the "elbow" where the rate of decrease sharply flattens. Before the elbow, each new cluster captures real structure. After the elbow, new clusters just split existing groups without gaining much.

The problem? The elbow is often ambiguous. Real data rarely has a sharp kink; it's more of a gradual curve.

Silhouette Score

A more principled approach. For each point i, compute:

a(i): mean distance to other points in the same cluster (cohesion)
b(i): mean distance to points in the nearest other cluster (separation)
s(i) = (b(i) - a(i)) / max(a(i), b(i))

The silhouette score ranges from -1 to +1. A score near +1 means the point is well-matched to its own cluster and poorly matched to neighboring clusters. A score near 0 means the point sits on the boundary between clusters. A negative score means the point is probably in the wrong cluster.

def silhouette_score(X, labels):
    """Compute mean silhouette score for a clustering."""
    n = len(X)
    scores = np.zeros(n)
    unique_labels = np.unique(labels)

    if len(unique_labels) < 2:
        return 0.0  # need at least 2 clusters

    for i in range(n):
        # a(i): mean distance to other same-cluster points
        mask = (labels == labels[i])
        mask[i] = False
        same = X[mask]
        a_i = np.mean(np.linalg.norm(same - X[i], axis=1)) if len(same) > 0 else 0

        # b(i): mean distance to nearest other cluster
        b_i = np.inf
        for k in unique_labels:
            if k == labels[i]:
                continue
            other = X[labels == k]
            mean_dist = np.mean(np.linalg.norm(other - X[i], axis=1))
            b_i = min(b_i, mean_dist)

        scores[i] = (b_i - a_i) / max(a_i, b_i) if max(a_i, b_i) > 0 else 0

    return np.mean(scores)

def elbow_and_silhouette(X, max_k=10):
    """Compute WCSS and silhouette for K=2..max_k."""
    wcss_vals, sil_vals = [], []
    for k in range(2, max_k + 1):
        labels, _, wcss = kmeans_pp(X, K=k)
        wcss_vals.append(wcss)
        sil_vals.append(silhouette_score(X, labels))
    return wcss_vals, sil_vals

wcss_vals, sil_vals = elbow_and_silhouette(X, max_k=8)
for k, (w, s) in enumerate(zip(wcss_vals, sil_vals), start=2):
    print(f"K={k}: WCSS={w:>7.1f}, Silhouette={s:.3f}")
# K=2: WCSS= 586.4, Silhouette=0.541
# K=3: WCSS= 279.4, Silhouette=0.683
# K=4: WCSS= 232.8, Silhouette=0.571
# K=5: WCSS= 196.2, Silhouette=0.508
# K=6: WCSS= 167.4, Silhouette=0.470
# K=7: WCSS= 143.9, Silhouette=0.435
# K=8: WCSS= 124.6, Silhouette=0.418

Both methods agree here: K=3 is the sweet spot. The silhouette score peaks at 0.683 for K=3 (well-separated clusters) and drops for higher K. The WCSS shows a clear elbow between K=3 and K=4 — adding a fourth cluster only reduces WCSS from 279 to 233, while going from K=2 to K=3 drops it from 586 to 279.

But here's the honest truth: domain knowledge matters more than any automatic method. If you're segmenting customers, ask the business team how many segments they can actually act on. If you're compressing colors, the "right K" is whatever looks good to a human eye.

Try It: Choosing K Visualizer

Drag the K slider and watch the clustering update alongside the Elbow and Silhouette plots. The dashed line marks the current K. Try different datasets to see when the "right K" is obvious vs ambiguous.

K: 3

Data:

WCSS: — | Silhouette: —

When K-Means Fails — and DBSCAN

K-Means makes three strong assumptions about your data: clusters are spherical (roughly round), equally sized, and equally dense. When these assumptions are violated, K-Means fails in predictable and instructive ways.

Consider three failure cases:

Elongated clusters (two banana shapes): K-Means splits each banana in half instead of recognizing two natural groups
Different-sized clusters: K-Means splits the large cluster to equalize sizes, even though the small cluster is a single natural group
Different densities: K-Means steals points from the dense cluster and gives them to the sparse one, trying to equalize the cluster "radius"

The fix? Use a clustering algorithm that doesn't assume spherical clusters. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely packed together and labels sparse points as outliers:

def dbscan(X, eps=0.5, min_pts=5):
    """DBSCAN: density-based clustering with noise detection."""
    n = len(X)
    labels = np.full(n, -1)  # -1 = unvisited
    cluster_id = 0

    # precompute pairwise distances
    dists = np.linalg.norm(X[:, None] - X[None, :], axis=2)

    for i in range(n):
        if labels[i] != -1:
            continue  # already assigned

        # find neighbors within eps
        neighbors = np.where(dists[i] <= eps)[0]

        if len(neighbors) < min_pts:
            labels[i] = -2  # noise point
            continue

        # start a new cluster
        labels[i] = cluster_id
        seed_set = list(neighbors)
        idx = 0

        while idx < len(seed_set):
            q = seed_set[idx]
            if labels[q] == -2:
                labels[q] = cluster_id  # noise becomes border point
            if labels[q] != -1:
                idx += 1
                continue

            labels[q] = cluster_id
            q_neighbors = np.where(dists[q] <= eps)[0]
            if len(q_neighbors) >= min_pts:
                seed_set.extend(q_neighbors.tolist())
            idx += 1

        cluster_id += 1

    return labels

# Two moon-shaped clusters
np.random.seed(42)
n = 150
angles_top = np.linspace(0, np.pi, n)
angles_bot = np.linspace(0, np.pi, n)
moon1 = np.column_stack([np.cos(angles_top), np.sin(angles_top)])
moon2 = np.column_stack([1 - np.cos(angles_bot), 0.5 - np.sin(angles_bot)])
moon1 += np.random.randn(n, 2) * 0.08
moon2 += np.random.randn(n, 2) * 0.08
X_moons = np.vstack([moon1, moon2])

# K-Means fails: splits moons incorrectly
km_labels, _, _ = kmeans_pp(X_moons, K=2)
km_acc = max(np.mean(km_labels[:n] == km_labels[0]),
             np.mean(km_labels[:n] != km_labels[0]))

# DBSCAN succeeds: follows the density
db_labels = dbscan(X_moons, eps=0.2, min_pts=5)
n_clusters = len(set(db_labels)) - (1 if -2 in db_labels else 0)
n_noise = np.sum(db_labels == -2)

print(f"K-Means: cluster purity = {km_acc:.0%}")
print(f"DBSCAN:  {n_clusters} clusters found, {n_noise} noise points")
# K-Means: cluster purity = 78%
# DBSCAN:  2 clusters found, 3 noise points

On the two-moons dataset, K-Means achieves only 78% purity because it draws a straight line between the interlocking crescent shapes. DBSCAN finds both moons perfectly by following the density, and even identifies 3 outlier points as noise — a capability K-Means simply doesn't have.

DBSCAN's tradeoff: you don't need to specify K, but you do need to choose eps (the neighborhood radius) and min_pts (the minimum points to form a dense region). These are often easier to set than K because they have physical meaning — "how close must points be to be neighbors?" is more intuitive than "how many groups are there?"

K-Means in Practice — Image Color Quantization

The most visual application of K-Means: compressing an image's colors. A 24-bit image can use up to 16.7 million distinct colors. By treating each pixel as a 3D point in RGB space and running K-Means, we find the K most representative colors and replace every pixel with its nearest centroid color.

def quantize_colors(pixels, K=16, max_iters=20, seed=42):
    """Reduce image to K colors using K-Means on RGB values."""
    rng = np.random.RandomState(seed)
    pixels_float = pixels.astype(np.float64)

    # K-Means++ init on pixel colors
    centroids = [pixels_float[rng.randint(len(pixels_float))]]
    for _ in range(1, K):
        dists = np.min([np.sum((pixels_float - c) ** 2, axis=1)
                        for c in centroids], axis=0)
        probs = dists / dists.sum()
        centroids.append(pixels_float[rng.choice(len(pixels_float), p=probs)])
    centroids = np.array(centroids)

    for _ in range(max_iters):
        distances = np.linalg.norm(pixels_float[:, None] - centroids[None, :], axis=2)
        labels = np.argmin(distances, axis=1)
        new_centroids = np.array([
            pixels_float[labels == k].mean(axis=0) if np.sum(labels == k) > 0
            else centroids[k]
            for k in range(K)
        ])
        if np.allclose(centroids, new_centroids, atol=0.5):
            break
        centroids = new_centroids

    # replace each pixel with its centroid color
    quantized = centroids[labels].astype(np.uint8)
    return quantized, centroids

# Example: 100x100 synthetic gradient image
np.random.seed(42)
h, w = 100, 100
img = np.zeros((h * w, 3), dtype=np.uint8)
for i in range(h):
    for j in range(w):
        img[i * w + j] = [int(255 * i / h), int(255 * j / w), 128]

for K in [4, 8, 16, 32]:
    quantized, _ = quantize_colors(img, K=K)
    unique_colors = len(np.unique(quantized, axis=0))
    mse = np.mean((img.astype(float) - quantized.astype(float)) ** 2)
    print(f"K={K:>2}: {unique_colors} colors, MSE={mse:.1f}")
# K= 4: 4 colors, MSE=544.3
# K= 8: 8 colors, MSE=157.8
# K=16: 16 colors, MSE=42.9
# K=32: 32 colors, MSE=11.3

Each doubling of K roughly quarters the mean squared error. By K=16, the quantized image is visually nearly indistinguishable from the original — and you've reduced from millions of possible colors to just 16. This is actual image compression used in GIF and PNG-8 formats.

Beyond images, K-Means appears everywhere in practice: customer segmentation (cluster users by behavior to target marketing), feature engineering (use cluster membership as a new feature for gradient boosting), anomaly detection (points far from all centroids are outliers), and vector quantization in learned representations.

Conclusion

K-Means is where unsupervised learning begins. We built Lloyd's classic algorithm in 20 lines, improved it with K-Means++ smart initialization, learned to choose K with elbow plots and silhouette scores, saw where K-Means breaks down (non-spherical clusters), and built DBSCAN as the density-based alternative. Along the way, we applied K-Means to image color quantization — proof that this 70-year-old algorithm still earns its keep in production.

The deeper lesson is philosophical: unsupervised learning forces us to ask what structure exists in this data? rather than what labels should I predict? It's a fundamentally different way of thinking about machine learning, and K-Means is the simplest entry point into that world.

With supervised and unsupervised learning now both covered, there's one major paradigm left in our toolkit. Stay tuned as we continue building the foundations of machine learning from scratch.

References & Further Reading

Stuart Lloyd — Least Squares Quantization in PCM (1982) — the original K-Means paper (algorithm developed in 1957, published 25 years later)
David Arthur & Sergii Vassilvitskii — K-Means++: The Advantages of Careful Seeding (2007) — the O(log K) approximation guarantee for smart initialization
Martin Ester et al. — A Density-Based Algorithm for Discovering Clusters (DBSCAN, 1996) — the density-based alternative that finds arbitrary-shaped clusters
Robert Tibshirani, Guenther Walther & Trevor Hastie — Estimating the Number of Clusters via the Gap Statistic (2001) — a statistical approach to choosing K
Hastie, Tibshirani & Friedman — The Elements of Statistical Learning — chapter 14 covers unsupervised learning and clustering in depth
scikit-learn clustering documentation — comprehensive guide comparing K-Means, DBSCAN, and other clustering algorithms

Posts in this series: Autoencoders from Scratch (learned representations that cluster naturally), Embeddings from Scratch (vector spaces where distance means similarity), Gradient Boosting from Scratch (feature engineering with cluster labels), ML Evaluation from Scratch (validating unsupervised results), Loss Functions from Scratch (WCSS as an objective function)