← Back to Blog

Hierarchical Clustering from Scratch

Why Flat Clustering Isn't Enough

K-Means asks you to pick k before you've even looked at the data. DBSCAN discovers clusters automatically but can't tell you that two clusters are more similar to each other than to a third. What if you could see the entire hierarchy of similarity at once — every possible grouping from "everything is one cluster" to "every point is its own cluster" — in a single picture?

That's exactly what hierarchical clustering gives you. Instead of outputting a single flat partition, it produces a dendrogram — a binary tree of successive merges that encodes every level of granularity simultaneously. Cut the tree low and you get many fine-grained clusters. Cut it high and you get a few broad groups. The tree itself tells you which clusters are most similar, which merged late (meaning they're genuinely different), and where the natural boundaries lie.

This matters everywhere hierarchy exists naturally: biological taxonomy (species → genus → family → order), document topic organization, customer segmentation at multiple levels of detail, and gene expression analysis where groups of genes activate together at different scales.

There are two flavors: agglomerative (bottom-up, starting from individual points and merging) and divisive (top-down, starting from one cluster and splitting). Agglomerative dominates in practice because it's simpler and better understood, so that's what we'll build from scratch. By the end, you'll understand the algorithm, the four major ways to define "distance between clusters," and why the dendrogram is one of the most powerful visualizations in all of unsupervised learning.

Agglomerative Clustering — The Bottom-Up Algorithm

The algorithm is beautifully simple. Start with n singleton clusters (each data point is its own cluster). Compute all pairwise distances. Then repeat: find the two closest clusters, merge them, and update the distances. Keep going until everything is one big cluster. Record every merge along the way.

The result is a linkage matrix — a log of every merge event, recording which two clusters merged, at what distance, and the size of the resulting cluster. This matrix is all you need to draw the dendrogram and extract flat clusters at any granularity.

Here's the complete algorithm from scratch:

import numpy as np

def agglomerative_cluster(X):
    """Bottom-up hierarchical clustering. Returns scipy-style linkage matrix."""
    n = len(X)
    # Pairwise Euclidean distances
    dist = np.full((2 * n, 2 * n), np.inf)
    for i in range(n):
        for j in range(i + 1, n):
            d = np.sqrt(np.sum((X[i] - X[j]) ** 2))
            dist[i, j] = dist[j, i] = d

    active = set(range(n))       # clusters still alive
    sizes = {i: 1 for i in range(n)}
    linkage = []

    for step in range(n - 1):
        # Find closest pair among active clusters
        min_d, ci, cj = np.inf, -1, -1
        for i in active:
            for j in active:
                if i < j and dist[i, j] < min_d:
                    min_d, ci, cj = dist[i, j], i, j

        # Create new cluster with id = n + step
        new_id = n + step
        new_size = sizes[ci] + sizes[cj]
        linkage.append([ci, cj, min_d, new_size])
        sizes[new_id] = new_size

        # Update distances using single linkage (min)
        for k in active:
            if k != ci and k != cj:
                dist[new_id, k] = min(dist[ci, k], dist[cj, k])
                dist[k, new_id] = dist[new_id, k]

        active.discard(ci)
        active.discard(cj)
        active.add(new_id)

    return np.array(linkage)

# Test on simple data
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8],
              [1, 0.6], [9, 11], [8, 2], [10, 2]])
Z = agglomerative_cluster(X)
for row in Z:
    print(f"Merge {int(row[0]):2d} + {int(row[1]):2d}  "
          f"dist={row[2]:.2f}  size={int(row[3])}")

This O(n³) implementation is the naive version — we scan all pairs to find the minimum at each step. The key detail is the distance update line: after merging clusters ci and cj into new_id, we compute new_id's distance to every remaining cluster k. The formula used here is single linkage (min), but as we'll see next, this is where the real magic happens.

Linkage Criteria — The Heart of the Algorithm

The algorithm above uses min(dist[ci, k], dist[cj, k]) to update distances after a merge. This is single linkage — the distance between two clusters is the distance between their closest members. But this is just one of four major choices, and each produces dramatically different results.

The brilliant insight from Lance and Williams (1967) is that all four can be expressed as a single recurrence. When we merge clusters A and B into A∪B, the distance to any other cluster C is:

d(A∪B, C) = αA·d(A,C) + αB·d(B,C) + β·d(A,B) + γ·|d(A,C) - d(B,C)|

Different values of (αA, αB, β, γ) give different linkage criteria. This means we can write one generalized algorithm and swap linkages by changing four numbers:

def lance_williams_params(linkage, n_a, n_b, n_c):
    """Return (alpha_a, alpha_b, beta, gamma) for each linkage."""
    if linkage == "single":
        return 0.5, 0.5, 0, -0.5
    elif linkage == "complete":
        return 0.5, 0.5, 0, 0.5
    elif linkage == "average":
        return n_a / (n_a + n_b), n_b / (n_a + n_b), 0, 0
    elif linkage == "ward":
        n_t = n_a + n_b + n_c
        return (n_a + n_c) / n_t, (n_b + n_c) / n_t, -n_c / n_t, 0

def hierarchical_cluster(X, linkage="ward"):
    """Agglomerative clustering with configurable linkage."""
    n = len(X)
    dist = np.full((2 * n, 2 * n), np.inf)
    for i in range(n):
        for j in range(i + 1, n):
            if linkage == "ward":
                d = np.sum((X[i] - X[j]) ** 2)  # squared Euclidean
            else:
                d = np.sqrt(np.sum((X[i] - X[j]) ** 2))
            dist[i, j] = dist[j, i] = d

    active = set(range(n))
    sizes = {i: 1 for i in range(n)}
    result = []

    for step in range(n - 1):
        min_d, ci, cj = np.inf, -1, -1
        for i in active:
            for j in active:
                if i < j and dist[i, j] < min_d:
                    min_d, ci, cj = dist[i, j], i, j

        new_id = n + step
        new_size = sizes[ci] + sizes[cj]
        merge_dist = np.sqrt(min_d) if linkage == "ward" else min_d
        result.append([ci, cj, merge_dist, new_size])
        sizes[new_id] = new_size

        for k in active:
            if k != ci and k != cj:
                a_a, a_b, b, g = lance_williams_params(
                    linkage, sizes[ci], sizes[cj], sizes[k])
                dist[new_id, k] = (a_a * dist[ci, k] + a_b * dist[cj, k]
                                   + b * dist[ci, cj]
                                   + g * abs(dist[ci, k] - dist[cj, k]))
                dist[k, new_id] = dist[new_id, k]

        active.discard(ci)
        active.discard(cj)
        active.add(new_id)

    return np.array(result)

# Compare linkages on the same data
X = np.array([[0, 0], [0.5, 0], [3, 0], [3.5, 0],
              [6, 0], [6.3, 0], [6.6, 0]])
for method in ["single", "complete", "average", "ward"]:
    Z = hierarchical_cluster(X, method)
    print(f"\n{method.upper()} linkage:")
    for row in Z:
        print(f"  {int(row[0]):2d}+{int(row[1]):2d} dist={row[2]:.3f}")

Notice how Ward's linkage uses squared Euclidean distance internally (the variance decomposition requires it), then we take the square root for the reported merge distance. The Lance-Williams parameters for Ward's depend on cluster sizes — larger clusters resist merging, producing balanced partitions.

Dendrograms — Reading the Tree of Merges

The linkage matrix encodes the full merge history, but a table of numbers isn't very illuminating. The dendrogram turns that table into a picture you can read at a glance:

To read cluster membership at any granularity, draw a horizontal line at height h. The number of vertical lines it crosses equals the number of clusters at that threshold. Long vertical gaps between consecutive merges signal natural cluster boundaries — the tree is saying "these groups are genuinely separate."

One subtlety: leaf ordering is not unique. Any internal node's children can be swapped (flipped) without changing the tree structure. Good implementations order leaves to minimize the distance between adjacent points, making the dendrogram easier to read.

We can also measure how faithfully the dendrogram preserves the original pairwise distances using the cophenetic correlation. The cophenetic distance between two points is the height at which they first join the same cluster. If the correlation between original distances and cophenetic distances is high (> 0.8), the dendrogram is a faithful representation.

import matplotlib.pyplot as plt

def get_leaf_order(Z, n):
    """Compute leaf ordering from linkage matrix (no crossing branches)."""
    order = []
    def traverse(node_id):
        if node_id < n:
            order.append(node_id)
        else:
            row = Z[int(node_id) - n]
            traverse(int(row[0]))
            traverse(int(row[1]))
    traverse(n + len(Z) - 1)  # start from root
    return order

def plot_dendrogram(Z, n, ax, labels=None):
    """Draw dendrogram from linkage matrix."""
    order = get_leaf_order(Z, n)
    pos = {node: i for i, node in enumerate(order)}

    for i, (c1, c2, d, _) in enumerate(Z):
        c1, c2 = int(c1), int(c2)
        x1, x2 = pos[c1], pos[c2]
        # Draw the U-shaped merge: two verticals + one horizontal
        ax.plot([x1, x1], [0 if c1 < n else Z[c1 - n][2], d], 'b-', lw=1.5)
        ax.plot([x2, x2], [0 if c2 < n else Z[c2 - n][2], d], 'b-', lw=1.5)
        ax.plot([x1, x2], [d, d], 'b-', lw=1.5)
        pos[n + i] = (x1 + x2) / 2

    if labels:
        ax.set_xticks(range(n))
        ax.set_xticklabels([labels[i] for i in order])
    ax.set_ylabel("Merge distance")

def cophenetic_corr(X, Z):
    """Correlation between original and cophenetic distances."""
    n = len(X)
    orig, coph = [], []
    # Build cluster membership timeline
    members = {i: {i} for i in range(n)}
    merge_dist = {}
    for i, (c1, c2, d, _) in enumerate(Z):
        c1, c2 = int(c1), int(c2)
        new_id = n + i
        for a in members[c1]:
            for b in members[c2]:
                merge_dist[(min(a, b), max(a, b))] = d
        members[new_id] = members[c1] | members[c2]
    for i in range(n):
        for j in range(i + 1, n):
            orig.append(np.sqrt(np.sum((X[i] - X[j]) ** 2)))
            coph.append(merge_dist[(i, j)])
    return np.corrcoef(orig, coph)[0, 1]

# Example
np.random.seed(42)
X = np.vstack([np.random.randn(10, 2) + [0, 0],
               np.random.randn(10, 2) + [5, 5],
               np.random.randn(10, 2) + [10, 0]])
Z = hierarchical_cluster(X, "ward")
fig, ax = plt.subplots(figsize=(10, 4))
plot_dendrogram(Z, len(X), ax)
ax.set_title(f"Ward's Dendrogram (cophenetic r = "
             f"{cophenetic_corr(X, Z):.3f})")
plt.tight_layout()
plt.show()

The cophenetic correlation tells you how well the tree approximates the true distance structure. Ward's and average linkage typically score highest (> 0.85); single linkage often scores lower because its chaining behavior compresses many different true distances into similar merge heights.

Cutting the Dendrogram — Choosing the Number of Clusters

The dendrogram gives you the full hierarchy, but often you need a flat partition — k discrete clusters. The question is: where do you cut?

The simplest approach is a static cut: draw a horizontal line at height h and count the clusters below. But how do you choose h? Three strategies:

1. Merge distance gap. Sort the merge heights and look for the largest jump. A big gap means the algorithm had to reach much farther to make the next merge — a sign of natural cluster separation. This is the hierarchical equivalent of the elbow method.

2. Inconsistency coefficient. For each merge, compare its height to the average height of recent merges below it. If a merge is much higher than its children, it's "inconsistent" — likely bridging two genuinely different clusters.

3. Mojena's rule. Cut where merge height exceeds mean + k·std of all merge heights, with k ≈ 1.25. Simple but surprisingly effective.

def find_clusters(Z, n, n_clusters):
    """Extract flat cluster labels by cutting dendrogram."""
    labels = np.zeros(n, dtype=int)
    members = {i: [i] for i in range(n)}
    for i, (c1, c2, d, _) in enumerate(Z):
        c1, c2 = int(c1), int(c2)
        members[n + i] = members.get(c1, [c1]) + members.get(c2, [c2])

    # Cut: use the last (n_clusters - 1) merges as boundaries
    root = n + len(Z) - 1
    clusters = [root]
    for _ in range(n_clusters - 1):
        # Split the cluster merged at highest distance
        highest = max(clusters, key=lambda c: Z[c - n][2] if c >= n else -1)
        if highest < n:
            break
        row = Z[highest - n]
        clusters.remove(highest)
        clusters.extend([int(row[0]), int(row[1])])

    for label, cid in enumerate(clusters):
        for pt in members.get(cid, [cid]):
            labels[pt] = label
    return labels

def inconsistency_cut(Z):
    """Find best cut using inconsistency in merge distances."""
    heights = Z[:, 2]
    # Compute gaps between consecutive merge heights
    gaps = np.diff(heights)
    if len(gaps) == 0:
        return 1
    best_gap_idx = np.argmax(gaps)
    # Number of clusters = n - (index of gap) - 1
    n_clusters = len(Z) - best_gap_idx
    return n_clusters

# Demo: find natural clusters
np.random.seed(7)
X = np.vstack([np.random.randn(15, 2) * 0.5 + [0, 0],
               np.random.randn(15, 2) * 0.5 + [4, 4],
               np.random.randn(15, 2) * 0.5 + [8, 1]])
Z = hierarchical_cluster(X, "ward")
k = inconsistency_cut(Z)
labels = find_clusters(Z, len(X), k)
print(f"Suggested clusters: {k}")
print(f"Cluster sizes: {[np.sum(labels == i) for i in range(k)]}")

The merge distance gap method is the most intuitive: you're looking for the point where the algorithm has to "reach" significantly farther to make the next merge. On well-separated data, this produces a clear winner. On messier data, you might combine it with domain knowledge or use silhouette scores to validate.

Single Linkage Equals the Minimum Spanning Tree

Here's one of the most elegant results in clustering theory: single-linkage hierarchical clustering produces exactly the same dendrogram as successively removing edges from the minimum spanning tree in decreasing weight order.

The minimum spanning tree (MST) connects all points with the minimum total edge weight and no cycles. Build it with Kruskal's algorithm: sort all pairwise edges by distance, then greedily add each edge unless it creates a cycle. The connection to single linkage is direct — each MST edge addition corresponds to a merge in the agglomerative algorithm, in the same order.

This equivalence has a practical consequence: since building an MST is O(n² log n) rather than O(n³), you can compute single-linkage clustering much faster than the general algorithm.

class UnionFind:
    """Disjoint-set with path compression and union by rank."""
    def __init__(self, n):
        self.parent = list(range(n))
        self.rank = [0] * n
        self.size = [1] * n

    def find(self, x):
        while self.parent[x] != x:
            self.parent[x] = self.parent[self.parent[x]]
            x = self.parent[x]
        return x

    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra == rb:
            return False
        if self.rank[ra] < self.rank[rb]:
            ra, rb = rb, ra
        self.parent[rb] = ra
        self.size[ra] += self.size[rb]
        if self.rank[ra] == self.rank[rb]:
            self.rank[ra] += 1
        return True

def mst_single_linkage(X):
    """Single-linkage clustering via MST (Kruskal's algorithm)."""
    n = len(X)
    # All pairwise edges, sorted by distance
    edges = []
    for i in range(n):
        for j in range(i + 1, n):
            d = np.sqrt(np.sum((X[i] - X[j]) ** 2))
            edges.append((d, i, j))
    edges.sort()

    uf = UnionFind(n)
    linkage = []
    cluster_id = {i: i for i in range(n)}
    next_id = n

    for d, i, j in edges:
        ri, rj = uf.find(i), uf.find(j)
        if ri != rj:
            ci, cj = cluster_id[ri], cluster_id[rj]
            uf.union(ri, rj)
            new_root = uf.find(ri)
            new_size = uf.size[new_root]
            linkage.append([ci, cj, d, new_size])
            cluster_id[new_root] = next_id
            next_id += 1
        if len(linkage) == n - 1:
            break

    return np.array(linkage)

# Verify equivalence
np.random.seed(99)
X = np.random.randn(8, 2)
Z_agg = hierarchical_cluster(X, "single")
Z_mst = mst_single_linkage(X)
print("Agglomerative merge distances:", Z_agg[:, 2].round(3))
print("MST merge distances:          ", Z_mst[:, 2].round(3))
print("Match:", np.allclose(sorted(Z_agg[:, 2]), sorted(Z_mst[:, 2])))

The merge distances come out identical (though the cluster IDs may differ). This MST connection also links single-linkage clustering to HDBSCAN — which builds a mutual reachability MST and extracts a density-based hierarchy from it. If you've read the DBSCAN from Scratch post, HDBSCAN is essentially single linkage on a transformed distance space.

Scaling Up and When to Use What

The elephant in the room: our naive implementation is O(n³) time and O(n²) space. For 10,000 points, that's 1012 operations and a distance matrix consuming ~800 MB. Not great.

Fortunately, there are faster algorithms. The nearest-neighbor chain algorithm achieves O(n²) for all four linkages we've covered (single, complete, average, Ward's) by exploiting the fact that if A's nearest neighbor is B and B's nearest neighbor is A, they will definitely be merged next. Daniel Müllner's fastcluster library implements this in optimized C++.

For truly large datasets, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) builds a compact summary tree from a single data pass, then applies hierarchical clustering to the summary nodes instead of individual points.

So when should you reach for hierarchical clustering over K-Means or DBSCAN?

Property K-Means DBSCAN Hierarchical
Parameters k (# clusters) ε, minPts Linkage + cut height
Cluster shapes Spherical Arbitrary Depends on linkage
Handles noise No Yes (noise label) No (but outliers merge last)
Output Flat partition Flat partition Full hierarchy + flat
Scalability O(nk) — fast O(n log n) O(n²) to O(n³)
Best for Large n, known k Arbitrary shapes, noise Exploratory, hierarchy
import time

def benchmark_clustering(max_n=500, step=100):
    """Time hierarchical clustering as n grows."""
    from scipy.cluster.hierarchy import linkage as scipy_linkage
    results = []
    for n in range(step, max_n + 1, step):
        X = np.random.randn(n, 2)
        # Our implementation
        t0 = time.time()
        hierarchical_cluster(X, "ward")
        t_ours = time.time() - t0
        # Scipy's optimized version
        t0 = time.time()
        scipy_linkage(X, method="ward")
        t_scipy = time.time() - t0
        results.append((n, t_ours, t_scipy))
        print(f"n={n:4d}  ours={t_ours:.3f}s  scipy={t_scipy:.4f}s  "
              f"ratio={t_ours / max(t_scipy, 1e-6):.0f}x")
    return results

# Uncomment to run:
# benchmark_clustering()

On 500 points, our Python implementation takes a few seconds; scipy's C implementation handles it in milliseconds. The O(n³) vs O(n²) gap is dramatic. But the educational value of the from-scratch version is clear: you understand exactly what happens at each merge, and you can modify the linkage criterion to experiment freely.

Try It: Dendrogram Builder

Click the canvas to place up to 15 points. Choose a linkage criterion, then click Run to watch agglomerative merging step by step. The dendrogram builds in real-time below the scatter plot. Use the Cut Height slider to extract flat clusters.

Click to place points, then press Run.

Try It: Linkage Showdown

Compare all four linkage criteria on the same dataset. The best linkage for each shape is highlighted in blue.

50
Single Linkage
Complete Linkage
Average Linkage
Ward's Linkage

References & Further Reading