Hierarchical Clustering from Scratch
Why Flat Clustering Isn't Enough
K-Means asks you to pick k before you've even looked at the data. DBSCAN discovers clusters automatically but can't tell you that two clusters are more similar to each other than to a third. What if you could see the entire hierarchy of similarity at once — every possible grouping from "everything is one cluster" to "every point is its own cluster" — in a single picture?
That's exactly what hierarchical clustering gives you. Instead of outputting a single flat partition, it produces a dendrogram — a binary tree of successive merges that encodes every level of granularity simultaneously. Cut the tree low and you get many fine-grained clusters. Cut it high and you get a few broad groups. The tree itself tells you which clusters are most similar, which merged late (meaning they're genuinely different), and where the natural boundaries lie.
This matters everywhere hierarchy exists naturally: biological taxonomy (species → genus → family → order), document topic organization, customer segmentation at multiple levels of detail, and gene expression analysis where groups of genes activate together at different scales.
There are two flavors: agglomerative (bottom-up, starting from individual points and merging) and divisive (top-down, starting from one cluster and splitting). Agglomerative dominates in practice because it's simpler and better understood, so that's what we'll build from scratch. By the end, you'll understand the algorithm, the four major ways to define "distance between clusters," and why the dendrogram is one of the most powerful visualizations in all of unsupervised learning.
Agglomerative Clustering — The Bottom-Up Algorithm
The algorithm is beautifully simple. Start with n singleton clusters (each data point is its own cluster). Compute all pairwise distances. Then repeat: find the two closest clusters, merge them, and update the distances. Keep going until everything is one big cluster. Record every merge along the way.
The result is a linkage matrix — a log of every merge event, recording which two clusters merged, at what distance, and the size of the resulting cluster. This matrix is all you need to draw the dendrogram and extract flat clusters at any granularity.
Here's the complete algorithm from scratch:
import numpy as np
def agglomerative_cluster(X):
"""Bottom-up hierarchical clustering. Returns scipy-style linkage matrix."""
n = len(X)
# Pairwise Euclidean distances
dist = np.full((2 * n, 2 * n), np.inf)
for i in range(n):
for j in range(i + 1, n):
d = np.sqrt(np.sum((X[i] - X[j]) ** 2))
dist[i, j] = dist[j, i] = d
active = set(range(n)) # clusters still alive
sizes = {i: 1 for i in range(n)}
linkage = []
for step in range(n - 1):
# Find closest pair among active clusters
min_d, ci, cj = np.inf, -1, -1
for i in active:
for j in active:
if i < j and dist[i, j] < min_d:
min_d, ci, cj = dist[i, j], i, j
# Create new cluster with id = n + step
new_id = n + step
new_size = sizes[ci] + sizes[cj]
linkage.append([ci, cj, min_d, new_size])
sizes[new_id] = new_size
# Update distances using single linkage (min)
for k in active:
if k != ci and k != cj:
dist[new_id, k] = min(dist[ci, k], dist[cj, k])
dist[k, new_id] = dist[new_id, k]
active.discard(ci)
active.discard(cj)
active.add(new_id)
return np.array(linkage)
# Test on simple data
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8],
[1, 0.6], [9, 11], [8, 2], [10, 2]])
Z = agglomerative_cluster(X)
for row in Z:
print(f"Merge {int(row[0]):2d} + {int(row[1]):2d} "
f"dist={row[2]:.2f} size={int(row[3])}")
This O(n³) implementation is the naive version — we scan all pairs to find the minimum at each step. The key detail is the distance update line: after merging clusters ci and cj into new_id, we compute new_id's distance to every remaining cluster k. The formula used here is single linkage (min), but as we'll see next, this is where the real magic happens.
Linkage Criteria — The Heart of the Algorithm
The algorithm above uses min(dist[ci, k], dist[cj, k]) to update distances after a merge. This is single linkage — the distance between two clusters is the distance between their closest members. But this is just one of four major choices, and each produces dramatically different results.
- Single linkage: d(A,B) = min d(a,b) — nearest-neighbor. Finds elongated, chain-like clusters. Great for non-convex shapes but susceptible to "chaining" on noisy data.
- Complete linkage: d(A,B) = max d(a,b) — farthest-neighbor. Produces compact, spherical clusters. Resists chaining but is sensitive to outliers.
- Average linkage (UPGMA): d(A,B) = mean of all pairwise d(a,b). A balanced compromise between single and complete.
- Ward's linkage: Δ(A,B) = (|A|·|B|)/(|A|+|B|) · ‖μA - μB‖². Merges the pair that least increases total within-cluster variance — the same objective K-Means optimizes.
The brilliant insight from Lance and Williams (1967) is that all four can be expressed as a single recurrence. When we merge clusters A and B into A∪B, the distance to any other cluster C is:
d(A∪B, C) = αA·d(A,C) + αB·d(B,C) + β·d(A,B) + γ·|d(A,C) - d(B,C)|
Different values of (αA, αB, β, γ) give different linkage criteria. This means we can write one generalized algorithm and swap linkages by changing four numbers:
def lance_williams_params(linkage, n_a, n_b, n_c):
"""Return (alpha_a, alpha_b, beta, gamma) for each linkage."""
if linkage == "single":
return 0.5, 0.5, 0, -0.5
elif linkage == "complete":
return 0.5, 0.5, 0, 0.5
elif linkage == "average":
return n_a / (n_a + n_b), n_b / (n_a + n_b), 0, 0
elif linkage == "ward":
n_t = n_a + n_b + n_c
return (n_a + n_c) / n_t, (n_b + n_c) / n_t, -n_c / n_t, 0
def hierarchical_cluster(X, linkage="ward"):
"""Agglomerative clustering with configurable linkage."""
n = len(X)
dist = np.full((2 * n, 2 * n), np.inf)
for i in range(n):
for j in range(i + 1, n):
if linkage == "ward":
d = np.sum((X[i] - X[j]) ** 2) # squared Euclidean
else:
d = np.sqrt(np.sum((X[i] - X[j]) ** 2))
dist[i, j] = dist[j, i] = d
active = set(range(n))
sizes = {i: 1 for i in range(n)}
result = []
for step in range(n - 1):
min_d, ci, cj = np.inf, -1, -1
for i in active:
for j in active:
if i < j and dist[i, j] < min_d:
min_d, ci, cj = dist[i, j], i, j
new_id = n + step
new_size = sizes[ci] + sizes[cj]
merge_dist = np.sqrt(min_d) if linkage == "ward" else min_d
result.append([ci, cj, merge_dist, new_size])
sizes[new_id] = new_size
for k in active:
if k != ci and k != cj:
a_a, a_b, b, g = lance_williams_params(
linkage, sizes[ci], sizes[cj], sizes[k])
dist[new_id, k] = (a_a * dist[ci, k] + a_b * dist[cj, k]
+ b * dist[ci, cj]
+ g * abs(dist[ci, k] - dist[cj, k]))
dist[k, new_id] = dist[new_id, k]
active.discard(ci)
active.discard(cj)
active.add(new_id)
return np.array(result)
# Compare linkages on the same data
X = np.array([[0, 0], [0.5, 0], [3, 0], [3.5, 0],
[6, 0], [6.3, 0], [6.6, 0]])
for method in ["single", "complete", "average", "ward"]:
Z = hierarchical_cluster(X, method)
print(f"\n{method.upper()} linkage:")
for row in Z:
print(f" {int(row[0]):2d}+{int(row[1]):2d} dist={row[2]:.3f}")
Notice how Ward's linkage uses squared Euclidean distance internally (the variance decomposition requires it), then we take the square root for the reported merge distance. The Lance-Williams parameters for Ward's depend on cluster sizes — larger clusters resist merging, producing balanced partitions.
Dendrograms — Reading the Tree of Merges
The linkage matrix encodes the full merge history, but a table of numbers isn't very illuminating. The dendrogram turns that table into a picture you can read at a glance:
- x-axis: data points (leaves), arranged so no branches cross
- y-axis: merge distance (height of each horizontal bar)
- Horizontal bars: each merge event, connecting two subtrees
To read cluster membership at any granularity, draw a horizontal line at height h. The number of vertical lines it crosses equals the number of clusters at that threshold. Long vertical gaps between consecutive merges signal natural cluster boundaries — the tree is saying "these groups are genuinely separate."
One subtlety: leaf ordering is not unique. Any internal node's children can be swapped (flipped) without changing the tree structure. Good implementations order leaves to minimize the distance between adjacent points, making the dendrogram easier to read.
We can also measure how faithfully the dendrogram preserves the original pairwise distances using the cophenetic correlation. The cophenetic distance between two points is the height at which they first join the same cluster. If the correlation between original distances and cophenetic distances is high (> 0.8), the dendrogram is a faithful representation.
import matplotlib.pyplot as plt
def get_leaf_order(Z, n):
"""Compute leaf ordering from linkage matrix (no crossing branches)."""
order = []
def traverse(node_id):
if node_id < n:
order.append(node_id)
else:
row = Z[int(node_id) - n]
traverse(int(row[0]))
traverse(int(row[1]))
traverse(n + len(Z) - 1) # start from root
return order
def plot_dendrogram(Z, n, ax, labels=None):
"""Draw dendrogram from linkage matrix."""
order = get_leaf_order(Z, n)
pos = {node: i for i, node in enumerate(order)}
for i, (c1, c2, d, _) in enumerate(Z):
c1, c2 = int(c1), int(c2)
x1, x2 = pos[c1], pos[c2]
# Draw the U-shaped merge: two verticals + one horizontal
ax.plot([x1, x1], [0 if c1 < n else Z[c1 - n][2], d], 'b-', lw=1.5)
ax.plot([x2, x2], [0 if c2 < n else Z[c2 - n][2], d], 'b-', lw=1.5)
ax.plot([x1, x2], [d, d], 'b-', lw=1.5)
pos[n + i] = (x1 + x2) / 2
if labels:
ax.set_xticks(range(n))
ax.set_xticklabels([labels[i] for i in order])
ax.set_ylabel("Merge distance")
def cophenetic_corr(X, Z):
"""Correlation between original and cophenetic distances."""
n = len(X)
orig, coph = [], []
# Build cluster membership timeline
members = {i: {i} for i in range(n)}
merge_dist = {}
for i, (c1, c2, d, _) in enumerate(Z):
c1, c2 = int(c1), int(c2)
new_id = n + i
for a in members[c1]:
for b in members[c2]:
merge_dist[(min(a, b), max(a, b))] = d
members[new_id] = members[c1] | members[c2]
for i in range(n):
for j in range(i + 1, n):
orig.append(np.sqrt(np.sum((X[i] - X[j]) ** 2)))
coph.append(merge_dist[(i, j)])
return np.corrcoef(orig, coph)[0, 1]
# Example
np.random.seed(42)
X = np.vstack([np.random.randn(10, 2) + [0, 0],
np.random.randn(10, 2) + [5, 5],
np.random.randn(10, 2) + [10, 0]])
Z = hierarchical_cluster(X, "ward")
fig, ax = plt.subplots(figsize=(10, 4))
plot_dendrogram(Z, len(X), ax)
ax.set_title(f"Ward's Dendrogram (cophenetic r = "
f"{cophenetic_corr(X, Z):.3f})")
plt.tight_layout()
plt.show()
The cophenetic correlation tells you how well the tree approximates the true distance structure. Ward's and average linkage typically score highest (> 0.85); single linkage often scores lower because its chaining behavior compresses many different true distances into similar merge heights.
Cutting the Dendrogram — Choosing the Number of Clusters
The dendrogram gives you the full hierarchy, but often you need a flat partition — k discrete clusters. The question is: where do you cut?
The simplest approach is a static cut: draw a horizontal line at height h and count the clusters below. But how do you choose h? Three strategies:
1. Merge distance gap. Sort the merge heights and look for the largest jump. A big gap means the algorithm had to reach much farther to make the next merge — a sign of natural cluster separation. This is the hierarchical equivalent of the elbow method.
2. Inconsistency coefficient. For each merge, compare its height to the average height of recent merges below it. If a merge is much higher than its children, it's "inconsistent" — likely bridging two genuinely different clusters.
3. Mojena's rule. Cut where merge height exceeds mean + k·std of all merge heights, with k ≈ 1.25. Simple but surprisingly effective.
def find_clusters(Z, n, n_clusters):
"""Extract flat cluster labels by cutting dendrogram."""
labels = np.zeros(n, dtype=int)
members = {i: [i] for i in range(n)}
for i, (c1, c2, d, _) in enumerate(Z):
c1, c2 = int(c1), int(c2)
members[n + i] = members.get(c1, [c1]) + members.get(c2, [c2])
# Cut: use the last (n_clusters - 1) merges as boundaries
root = n + len(Z) - 1
clusters = [root]
for _ in range(n_clusters - 1):
# Split the cluster merged at highest distance
highest = max(clusters, key=lambda c: Z[c - n][2] if c >= n else -1)
if highest < n:
break
row = Z[highest - n]
clusters.remove(highest)
clusters.extend([int(row[0]), int(row[1])])
for label, cid in enumerate(clusters):
for pt in members.get(cid, [cid]):
labels[pt] = label
return labels
def inconsistency_cut(Z):
"""Find best cut using inconsistency in merge distances."""
heights = Z[:, 2]
# Compute gaps between consecutive merge heights
gaps = np.diff(heights)
if len(gaps) == 0:
return 1
best_gap_idx = np.argmax(gaps)
# Number of clusters = n - (index of gap) - 1
n_clusters = len(Z) - best_gap_idx
return n_clusters
# Demo: find natural clusters
np.random.seed(7)
X = np.vstack([np.random.randn(15, 2) * 0.5 + [0, 0],
np.random.randn(15, 2) * 0.5 + [4, 4],
np.random.randn(15, 2) * 0.5 + [8, 1]])
Z = hierarchical_cluster(X, "ward")
k = inconsistency_cut(Z)
labels = find_clusters(Z, len(X), k)
print(f"Suggested clusters: {k}")
print(f"Cluster sizes: {[np.sum(labels == i) for i in range(k)]}")
The merge distance gap method is the most intuitive: you're looking for the point where the algorithm has to "reach" significantly farther to make the next merge. On well-separated data, this produces a clear winner. On messier data, you might combine it with domain knowledge or use silhouette scores to validate.
Single Linkage Equals the Minimum Spanning Tree
Here's one of the most elegant results in clustering theory: single-linkage hierarchical clustering produces exactly the same dendrogram as successively removing edges from the minimum spanning tree in decreasing weight order.
The minimum spanning tree (MST) connects all points with the minimum total edge weight and no cycles. Build it with Kruskal's algorithm: sort all pairwise edges by distance, then greedily add each edge unless it creates a cycle. The connection to single linkage is direct — each MST edge addition corresponds to a merge in the agglomerative algorithm, in the same order.
This equivalence has a practical consequence: since building an MST is O(n² log n) rather than O(n³), you can compute single-linkage clustering much faster than the general algorithm.
class UnionFind:
"""Disjoint-set with path compression and union by rank."""
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n
self.size = [1] * n
def find(self, x):
while self.parent[x] != x:
self.parent[x] = self.parent[self.parent[x]]
x = self.parent[x]
return x
def union(self, a, b):
ra, rb = self.find(a), self.find(b)
if ra == rb:
return False
if self.rank[ra] < self.rank[rb]:
ra, rb = rb, ra
self.parent[rb] = ra
self.size[ra] += self.size[rb]
if self.rank[ra] == self.rank[rb]:
self.rank[ra] += 1
return True
def mst_single_linkage(X):
"""Single-linkage clustering via MST (Kruskal's algorithm)."""
n = len(X)
# All pairwise edges, sorted by distance
edges = []
for i in range(n):
for j in range(i + 1, n):
d = np.sqrt(np.sum((X[i] - X[j]) ** 2))
edges.append((d, i, j))
edges.sort()
uf = UnionFind(n)
linkage = []
cluster_id = {i: i for i in range(n)}
next_id = n
for d, i, j in edges:
ri, rj = uf.find(i), uf.find(j)
if ri != rj:
ci, cj = cluster_id[ri], cluster_id[rj]
uf.union(ri, rj)
new_root = uf.find(ri)
new_size = uf.size[new_root]
linkage.append([ci, cj, d, new_size])
cluster_id[new_root] = next_id
next_id += 1
if len(linkage) == n - 1:
break
return np.array(linkage)
# Verify equivalence
np.random.seed(99)
X = np.random.randn(8, 2)
Z_agg = hierarchical_cluster(X, "single")
Z_mst = mst_single_linkage(X)
print("Agglomerative merge distances:", Z_agg[:, 2].round(3))
print("MST merge distances: ", Z_mst[:, 2].round(3))
print("Match:", np.allclose(sorted(Z_agg[:, 2]), sorted(Z_mst[:, 2])))
The merge distances come out identical (though the cluster IDs may differ). This MST connection also links single-linkage clustering to HDBSCAN — which builds a mutual reachability MST and extracts a density-based hierarchy from it. If you've read the DBSCAN from Scratch post, HDBSCAN is essentially single linkage on a transformed distance space.
Scaling Up and When to Use What
The elephant in the room: our naive implementation is O(n³) time and O(n²) space. For 10,000 points, that's 1012 operations and a distance matrix consuming ~800 MB. Not great.
Fortunately, there are faster algorithms. The nearest-neighbor chain algorithm achieves O(n²) for all four linkages we've covered (single, complete, average, Ward's) by exploiting the fact that if A's nearest neighbor is B and B's nearest neighbor is A, they will definitely be merged next. Daniel Müllner's fastcluster library implements this in optimized C++.
For truly large datasets, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) builds a compact summary tree from a single data pass, then applies hierarchical clustering to the summary nodes instead of individual points.
So when should you reach for hierarchical clustering over K-Means or DBSCAN?
| Property | K-Means | DBSCAN | Hierarchical |
|---|---|---|---|
| Parameters | k (# clusters) | ε, minPts | Linkage + cut height |
| Cluster shapes | Spherical | Arbitrary | Depends on linkage |
| Handles noise | No | Yes (noise label) | No (but outliers merge last) |
| Output | Flat partition | Flat partition | Full hierarchy + flat |
| Scalability | O(nk) — fast | O(n log n) | O(n²) to O(n³) |
| Best for | Large n, known k | Arbitrary shapes, noise | Exploratory, hierarchy |
import time
def benchmark_clustering(max_n=500, step=100):
"""Time hierarchical clustering as n grows."""
from scipy.cluster.hierarchy import linkage as scipy_linkage
results = []
for n in range(step, max_n + 1, step):
X = np.random.randn(n, 2)
# Our implementation
t0 = time.time()
hierarchical_cluster(X, "ward")
t_ours = time.time() - t0
# Scipy's optimized version
t0 = time.time()
scipy_linkage(X, method="ward")
t_scipy = time.time() - t0
results.append((n, t_ours, t_scipy))
print(f"n={n:4d} ours={t_ours:.3f}s scipy={t_scipy:.4f}s "
f"ratio={t_ours / max(t_scipy, 1e-6):.0f}x")
return results
# Uncomment to run:
# benchmark_clustering()
On 500 points, our Python implementation takes a few seconds; scipy's C implementation handles it in milliseconds. The O(n³) vs O(n²) gap is dramatic. But the educational value of the from-scratch version is clear: you understand exactly what happens at each merge, and you can modify the linkage criterion to experiment freely.
Try It: Dendrogram Builder
Click the canvas to place up to 15 points. Choose a linkage criterion, then click Run to watch agglomerative merging step by step. The dendrogram builds in real-time below the scatter plot. Use the Cut Height slider to extract flat clusters.
Try It: Linkage Showdown
Compare all four linkage criteria on the same dataset. The best linkage for each shape is highlighted in blue.
Single Linkage
Complete Linkage
Average Linkage
Ward's Linkage
References & Further Reading
- Ward, J.H. (1963) — "Hierarchical Grouping to Optimize an Objective Function" — the original Ward's method paper defining minimum-variance linkage
- Lance, G.N. & Williams, W.T. (1967) — "A General Theory of Classificatory Sorting Strategies" — the unifying Lance-Williams recurrence formula
- Müllner, D. (2011) — "Modern Hierarchical, Agglomerative Clustering Algorithms" — nearest-neighbor chain algorithm and fastcluster implementation
- Murtagh, F. & Contreras, P. (2012) — "Algorithms for Hierarchical Clustering: An Overview" — comprehensive survey of agglomerative methods
- Campello, R., Moulavi, D. & Sander, J. (2013) — "Density-Based Clustering Based on Hierarchical Density Estimates" — HDBSCAN, extending single-linkage with density
- Sokal, R.R. & Rohlf, F.J. (1962) — "The Comparison of Dendrograms by Objective Methods" — cophenetic correlation for dendrogram quality