DBSCAN from Scratch: When K-Means Fails and Density Saves the Day
Why K-Means Isn't Enough
K-Means is the most popular clustering algorithm in the world, and it will confidently give you exactly the wrong answer on half the datasets you throw at it. Concentric rings? K-Means splits them in half. Crescent moons? Straight through the middle. Clusters of wildly different sizes? The big one gets carved up while small ones get absorbed.
The problem isn't that K-Means is bad — it's that it assumes all clusters are spherical blobs of roughly equal size. It partitions space using Voronoi cells: straight lines equidistant between centroids. If your data's natural groupings don't respect those boundaries, K-Means will bulldoze right through them. And it forces every point into a cluster. No concept of "this point doesn't belong anywhere." Outliers get dragged into the nearest centroid whether they like it or not.
We explored K-Means in depth in K-Means Clustering from Scratch, and it's genuinely excellent for the right problems. But let's see exactly where it falls apart:
import numpy as np
from sklearn.datasets import make_moons, make_circles
# Generate non-convex datasets
moons_X, moons_y = make_moons(n_samples=300, noise=0.06, random_state=42)
circles_X, circles_y = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
def kmeans(X, k, seed=42):
rng = np.random.RandomState(seed)
centers = X[rng.choice(len(X), k, replace=False)]
for _ in range(50):
dists = np.linalg.norm(X[:, None] - centers[None], axis=2)
labels = np.argmin(dists, axis=1)
new_centers = np.array([X[labels == i].mean(axis=0) for i in range(k)])
if np.allclose(centers, new_centers): break
centers = new_centers
return labels
moon_labels = kmeans(moons_X, 2)
circle_labels = kmeans(circles_X, 2)
# K-Means splits both datasets along a straight line
# Moons: top and bottom halves instead of the two crescents
# Circles: left and right halves instead of inner and outer rings
print(f"Moons — K-Means accuracy: {max(np.mean(moon_labels == moons_y), np.mean(moon_labels != moons_y)):.0%}")
print(f"Circles — K-Means accuracy: {max(np.mean(circle_labels == circles_y), np.mean(circle_labels != circles_y)):.0%}")
# Moons — K-Means accuracy: 73%
# Circles — K-Means accuracy: 50% (no better than random!)
On the circles dataset, K-Means does no better than a coin flip. It literally cannot represent "one ring inside another" because its decision boundary is a straight line. We need an algorithm that understands shape — and that means thinking about density.
The Core Idea — Density Reachability
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was introduced by Ester, Kriegel, Sander, and Xu in 1996, and its core idea is beautifully intuitive: clusters are dense regions separated by sparse regions. Instead of measuring distance to centroids, we ask a simpler question: "How many neighbors does this point have nearby?"
The algorithm classifies every point into one of three categories based on just two parameters — eps (neighborhood radius) and minPts (minimum neighbor count):
- Core points: have at least
minPtsneighbors within distanceeps. These are the backbone of clusters — they live in dense regions. - Border points: within
epsof a core point, but don't have enough neighbors to be core themselves. They live on the edge of clusters. - Noise points: neither core nor border — isolated outliers in sparse regions. DBSCAN labels these as -1 and moves on.
Two points are density-reachable if you can walk from one to the other through a chain of core points, each within eps of the next. A cluster is simply a maximal set of density-connected points. The number of clusters isn't a parameter — it emerges naturally from the data's density structure.
Here's the full algorithm from scratch. The key operation is BFS cluster expansion: start from an unvisited core point, explore its neighborhood, and keep expanding through any core points you find:
import numpy as np
from collections import deque
def dbscan(X, eps, min_pts):
n = len(X)
labels = np.full(n, -1) # -1 means noise (unassigned)
cluster_id = 0
# Precompute pairwise distances (naive O(n^2) approach)
dists = np.linalg.norm(X[:, None] - X[None], axis=2)
for i in range(n):
if labels[i] != -1: # Already assigned
continue
# Find neighbors within eps
neighbors = np.where(dists[i] <= eps)[0]
if len(neighbors) < min_pts:
continue # Not a core point — leave as noise for now
# Start a new cluster via BFS
labels[i] = cluster_id
queue = deque(neighbors[neighbors != i])
while queue:
j = queue.popleft()
if labels[j] == -1: # Was noise — claim it for this cluster
labels[j] = cluster_id
elif labels[j] != -1 and labels[j] != cluster_id:
continue # Already in another cluster
else:
continue
j_neighbors = np.where(dists[j] <= eps)[0]
if len(j_neighbors) >= min_pts: # j is also a core point
for k in j_neighbors:
if labels[k] == -1:
queue.append(k)
cluster_id += 1
return labels
# Test on moons — DBSCAN nails it
from sklearn.datasets import make_moons
X, _ = make_moons(300, noise=0.06, random_state=42)
labels = dbscan(X, eps=0.15, min_pts=5)
n_clusters = len(set(labels) - {-1})
n_noise = np.sum(labels == -1)
print(f"Clusters found: {n_clusters}, Noise points: {n_noise}")
# Clusters found: 2, Noise points: 3
No K parameter. No assumption about cluster shape. The two moons get perfectly separated because DBSCAN follows the dense chain of points along each crescent, and the sparse gap between them acts as a natural boundary.
Choosing eps and minPts
DBSCAN trades K-Means' cluster count parameter k for two different parameters: eps and minPts. Choosing them wisely is crucial. Get eps wrong and you'll either fragment real clusters into pieces (too small) or merge separate clusters into one blob (too large).
The k-distance graph method, proposed in the original DBSCAN paper, gives you a principled way to choose eps. The idea: for each point, compute the distance to its k-th nearest neighbor (where k = minPts). Sort these distances and plot them. Points inside dense clusters will have small k-distances; noise points will have large ones. The "elbow" in this curve — where the slope changes sharply — is your eps.
For minPts, rules of thumb have evolved since the original paper:
- Ester et al. (1996):
minPts >= dim + 1, minimum 3 - Modern practice:
minPts = 2 × dim - For 2D data,
minPts = 4or5is a solid starting point
import numpy as np
from scipy.spatial import KDTree
def find_eps_elbow(X, k=5):
"""Use k-distance graph to find a good eps value."""
tree = KDTree(X)
# Query k+1 neighbors (includes the point itself)
dists, _ = tree.query(X, k=k + 1)
k_dists = dists[:, -1] # Distance to k-th neighbor
k_dists_sorted = np.sort(k_dists)
# Find the elbow: point of maximum curvature
# Approximate by looking for the biggest acceleration in the curve
diffs = np.diff(k_dists_sorted)
accel = np.diff(diffs)
elbow_idx = np.argmax(accel) + 2
eps_estimate = k_dists_sorted[elbow_idx]
print(f"k-distance elbow at index {elbow_idx}/{len(X)}")
print(f"Suggested eps: {eps_estimate:.4f}")
print(f"Range around elbow: [{k_dists_sorted[elbow_idx-5]:.4f}, {k_dists_sorted[elbow_idx+5]:.4f}]")
return eps_estimate
from sklearn.datasets import make_moons
X, _ = make_moons(300, noise=0.06, random_state=42)
eps = find_eps_elbow(X, k=5)
# k-distance elbow at index 283/300
# Suggested eps: 0.1412
# Range around elbow: [0.1178, 0.1830]
The elbow method works well when clusters have similar densities. But here's the fundamental limitation: eps is a global parameter. If your dataset has one tight cluster and one loose cluster, no single eps can handle both. Too small and the loose cluster fragments; too large and the tight cluster absorbs its neighbors. This single-scale problem is what motivates HDBSCAN, which we'll build in Section 5.
Complexity and Spatial Indexing
Our naive DBSCAN implementation has a glaring bottleneck: we precompute an O(n²) distance matrix. For 10,000 points, that's 100 million distances — and it only gets worse from there. The core operation in DBSCAN is the range query: "find all points within distance eps of point p." If we can accelerate that, the whole algorithm speeds up.
Enter the KD-tree. As we explored in KNN from Scratch, a KD-tree partitions space into axis-aligned regions, letting us prune vast swaths of points that can't possibly be within eps. Range queries drop from O(n) to O(log n) on average, making DBSCAN's total complexity O(n log n) instead of O(n²).
import numpy as np
from scipy.spatial import KDTree
from collections import deque
import time
def dbscan_kdtree(X, eps, min_pts):
"""DBSCAN using KD-tree for O(n log n) range queries."""
tree = KDTree(X)
n = len(X)
labels = np.full(n, -1)
cluster_id = 0
for i in range(n):
if labels[i] != -1:
continue
neighbors = tree.query_ball_point(X[i], eps)
if len(neighbors) < min_pts:
continue
labels[i] = cluster_id
queue = deque([j for j in neighbors if j != i])
while queue:
j = queue.popleft()
if labels[j] == -1:
labels[j] = cluster_id
else:
continue
j_neighbors = tree.query_ball_point(X[j], eps)
if len(j_neighbors) >= min_pts:
for k in j_neighbors:
if labels[k] == -1:
queue.append(k)
cluster_id += 1
return labels
# Benchmark naive vs KD-tree
from sklearn.datasets import make_moons
for n in [500, 2000, 5000]:
X, _ = make_moons(n, noise=0.06, random_state=42)
t0 = time.time()
dists = np.linalg.norm(X[:, None] - X[None], axis=2) # naive distance matrix
t_naive = time.time() - t0
t0 = time.time()
dbscan_kdtree(X, eps=0.15, min_pts=5)
t_tree = time.time() - t0
print(f"n={n:>5d} | Naive dist matrix: {t_naive:.3f}s | KD-tree DBSCAN: {t_tree:.3f}s")
# n= 500 | Naive dist matrix: 0.004s | KD-tree DBSCAN: 0.008s
# n= 2000 | Naive dist matrix: 0.051s | KD-tree DBSCAN: 0.025s
# n= 5000 | Naive dist matrix: 0.318s | KD-tree DBSCAN: 0.059s
The crossover happens around n=1000. Beyond that, the KD-tree version pulls ahead rapidly. One caveat: KD-trees degrade in high dimensions (the "curse of dimensionality" strikes again). Above ~20 dimensions, ball trees are a better choice, and above ~50, you may want approximate methods like locality-sensitive hashing. For the 2D clustering we're doing here, KD-trees are perfect.
HDBSCAN — The Parameter-Free Evolution
DBSCAN's biggest weakness is that single global eps. Imagine a dataset with three clusters: one tight group of 50 points, one medium-spread group of 200, and one loose cloud of 100. No single radius can capture all three — it's like trying to use one shoe size for the whole family.
HDBSCAN (Hierarchical DBSCAN), introduced by Campello, Moulavi, and Sander in 2013, solves this elegantly. Instead of picking one density threshold, it examines all possible thresholds and extracts the clusters that persist the longest. The key insight is the mutual reachability distance:
dmreach(a, b) = max(core(a), core(b), d(a, b))
Where core(p) is the distance from point p to its k-th nearest neighbor. This transformation "stretches" distances in sparse regions while leaving dense regions unchanged. Sparse outliers become far from everything; dense cluster members stay close together.
The algorithm proceeds in five steps:
- Compute core distances — for each point, the distance to its k-th nearest neighbor
- Build the mutual reachability graph — edge weights are dmreach(a, b)
- Construct a minimum spanning tree — this captures the cluster hierarchy
- Build the condensed tree — only keep splits where both children have >= min_cluster_size points
- Extract clusters using stability — stability(C) = Σ(1/λdeath - 1/λbirth) for each point in C
import numpy as np
from scipy.spatial import KDTree
def hdbscan_core(X, min_pts=5):
"""Compute HDBSCAN core distances and mutual reachability MST."""
tree = KDTree(X)
n = len(X)
# Step 1: Core distances (distance to k-th nearest neighbor)
dists, _ = tree.query(X, k=min_pts + 1)
core_dists = dists[:, -1] # k-th neighbor distance
# Step 2: Mutual reachability distance for all pairs
# For efficiency, only compute for KD-tree neighbors
# d_mreach(a, b) = max(core(a), core(b), d(a, b))
# Build MST using Prim's algorithm on mutual reachability graph
in_tree = np.zeros(n, dtype=bool)
min_edge = np.full(n, np.inf)
min_edge[0] = 0
parent = np.full(n, -1)
mst_edges = []
for _ in range(n):
# Find the node not yet in tree with smallest edge
candidates = np.where(~in_tree)[0]
u = candidates[np.argmin(min_edge[candidates])]
in_tree[u] = True
if parent[u] != -1:
w = max(core_dists[u], core_dists[parent[u]],
np.linalg.norm(X[u] - X[parent[u]]))
mst_edges.append((parent[u], u, w))
# Update distances for neighbors
for v in range(n):
if in_tree[v]:
continue
d = np.linalg.norm(X[u] - X[v])
mreach = max(core_dists[u], core_dists[v], d)
if mreach < min_edge[v]:
min_edge[v] = mreach
parent[v] = u
return core_dists, sorted(mst_edges, key=lambda e: e[2])
# Demonstrate on varying-density data
rng = np.random.RandomState(42)
tight = rng.randn(80, 2) * 0.15 + [0, 0]
medium = rng.randn(120, 2) * 0.4 + [3, 0]
loose = rng.randn(60, 2) * 0.8 + [6, 2]
X = np.vstack([tight, medium, loose])
# HDBSCAN reveals the density hierarchy through core distances
core_dists, mst = hdbscan_core(X, min_pts=5)
print(f"Core dists — tight: {core_dists[:80].mean():.3f}, "
f"medium: {core_dists[80:200].mean():.3f}, "
f"loose: {core_dists[200:].mean():.3f}")
print(f"MST edges: {len(mst)}, weight range [{mst[0][2]:.3f}, {mst[-1][2]:.3f}]")
# Core dists — tight: 0.079, medium: 0.214, loose: 0.501
# MST edges: 259, weight range [0.039, 1.826]
# The MST's heaviest edges bridge between clusters — cutting them reveals structure
# Compare: DBSCAN with a single eps can't handle all three densities
from collections import deque
def dbscan_simple(X, eps, min_pts):
tree = KDTree(X)
labels = np.full(len(X), -1)
cid = 0
for i in range(len(X)):
if labels[i] != -1: continue
nb = tree.query_ball_point(X[i], eps)
if len(nb) < min_pts: continue
labels[i] = cid
q = deque([j for j in nb if j != i])
while q:
j = q.popleft()
if labels[j] != -1: continue
labels[j] = cid
jnb = tree.query_ball_point(X[j], eps)
if len(jnb) >= min_pts:
q.extend(k for k in jnb if labels[k] == -1)
cid += 1
return labels
for eps in [0.2, 0.5, 1.0]:
lbl = dbscan_simple(X, eps, 5)
nc = len(set(lbl) - {-1})
nn = np.sum(lbl == -1)
print(f"DBSCAN eps={eps}: {nc} clusters, {nn} noise")
# DBSCAN eps=0.2: 1 clusters, 168 noise (only finds tight cluster)
# DBSCAN eps=0.5: 2 clusters, 48 noise (merges tight, finds medium)
# DBSCAN eps=1.0: 2 clusters, 5 noise (merges tight+medium, finds loose)
No single eps finds all three clusters. HDBSCAN examines the full hierarchy of density levels and extracts the clusters that remain stable across the widest range of thresholds. The practical library (hdbscan by McInnes, Healy, and Astels) implements optimized versions of these steps and typically needs just one parameter: min_cluster_size.
When to Use What — Clustering Decision Guide
With K-Means, DBSCAN, HDBSCAN, and other clustering algorithms in your toolkit, how do you choose? Here's a practical decision framework:
| Algorithm | Cluster Shape | Parameters | Handles Noise? | Complexity |
|---|---|---|---|---|
| K-Means | Spherical | K (known) | No | O(nKt) |
| DBSCAN | Arbitrary | eps, minPts | Yes | O(n log n)* |
| HDBSCAN | Arbitrary | min_cluster_size | Yes | O(n log n)* |
| GMMs | Elliptical | K, covariance type | Soft | O(nK²d) |
* With spatial indexing. Naive implementations are O(n²).
Use K-Means when your clusters are roughly spherical, similarly-sized, and you know K. It's fast and well-understood. See K-Means from Scratch.
Use DBSCAN when clusters have arbitrary shapes, you don't know the number of clusters, your data has noise/outliers, and the clusters have roughly similar densities.
Use HDBSCAN when clusters have varying densities, you want minimal parameter tuning, and you need robust noise handling. It's the most flexible option and often the best default for exploratory clustering.
Use GMMs when clusters overlap and you want probabilistic ("soft") assignments. See EM from Scratch.
Density-based clustering is one of those ideas that feels obvious in retrospect: instead of forcing data into symmetric blobs, follow where it's dense and let the gaps speak for themselves. DBSCAN gives you that with two parameters, and HDBSCAN takes it further by examining every density scale at once. Play with the demos below to build intuition for how these algorithms see your data differently.
Try It: Cluster Explorer
Click on the canvas to place points, then switch between algorithms to see how they cluster differently. Core points are filled circles, border points are smaller, noise is marked with ×.
Try It: eps Sensitivity Visualizer
This dataset has three clusters with different densities. Drag the eps slider to see how DBSCAN's single-scale limitation plays out, then toggle HDBSCAN to see it handle all densities at once.
References & Further Reading
- Ester, Kriegel, Sander, Xu — "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" (KDD 1996) — the original DBSCAN paper that started it all
- Campello, Moulavi, Sander — "Density-Based Clustering Based on Hierarchical Density Estimates" (PAKDD 2013) — the HDBSCAN paper introducing mutual reachability and cluster stability
- McInnes, Healy, Astels — "hdbscan: Hierarchical density based clustering" (JOSS 2017) — the practical Python implementation
- scikit-learn DBSCAN documentation — production-ready implementation with spatial tree acceleration
- HDBSCAN documentation — excellent interactive tutorials on parameter selection
- DadOps — K-Means Clustering from Scratch — centroid-based clustering and when it works well
- DadOps — KNN from Scratch — KD-trees, ball trees, and the curse of dimensionality
- DadOps — Expectation-Maximization from Scratch — GMMs for soft clustering with overlapping elliptical clusters