← Back to Blog

DBSCAN from Scratch: When K-Means Fails and Density Saves the Day

Why K-Means Isn't Enough

K-Means is the most popular clustering algorithm in the world, and it will confidently give you exactly the wrong answer on half the datasets you throw at it. Concentric rings? K-Means splits them in half. Crescent moons? Straight through the middle. Clusters of wildly different sizes? The big one gets carved up while small ones get absorbed.

The problem isn't that K-Means is bad — it's that it assumes all clusters are spherical blobs of roughly equal size. It partitions space using Voronoi cells: straight lines equidistant between centroids. If your data's natural groupings don't respect those boundaries, K-Means will bulldoze right through them. And it forces every point into a cluster. No concept of "this point doesn't belong anywhere." Outliers get dragged into the nearest centroid whether they like it or not.

We explored K-Means in depth in K-Means Clustering from Scratch, and it's genuinely excellent for the right problems. But let's see exactly where it falls apart:

import numpy as np
from sklearn.datasets import make_moons, make_circles

# Generate non-convex datasets
moons_X, moons_y = make_moons(n_samples=300, noise=0.06, random_state=42)
circles_X, circles_y = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

def kmeans(X, k, seed=42):
    rng = np.random.RandomState(seed)
    centers = X[rng.choice(len(X), k, replace=False)]
    for _ in range(50):
        dists = np.linalg.norm(X[:, None] - centers[None], axis=2)
        labels = np.argmin(dists, axis=1)
        new_centers = np.array([X[labels == i].mean(axis=0) for i in range(k)])
        if np.allclose(centers, new_centers): break
        centers = new_centers
    return labels

moon_labels = kmeans(moons_X, 2)
circle_labels = kmeans(circles_X, 2)

# K-Means splits both datasets along a straight line
# Moons: top and bottom halves instead of the two crescents
# Circles: left and right halves instead of inner and outer rings
print(f"Moons — K-Means accuracy: {max(np.mean(moon_labels == moons_y), np.mean(moon_labels != moons_y)):.0%}")
print(f"Circles — K-Means accuracy: {max(np.mean(circle_labels == circles_y), np.mean(circle_labels != circles_y)):.0%}")
# Moons — K-Means accuracy: 73%
# Circles — K-Means accuracy: 50%  (no better than random!)

On the circles dataset, K-Means does no better than a coin flip. It literally cannot represent "one ring inside another" because its decision boundary is a straight line. We need an algorithm that understands shape — and that means thinking about density.

The Core Idea — Density Reachability

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was introduced by Ester, Kriegel, Sander, and Xu in 1996, and its core idea is beautifully intuitive: clusters are dense regions separated by sparse regions. Instead of measuring distance to centroids, we ask a simpler question: "How many neighbors does this point have nearby?"

The algorithm classifies every point into one of three categories based on just two parameters — eps (neighborhood radius) and minPts (minimum neighbor count):

Two points are density-reachable if you can walk from one to the other through a chain of core points, each within eps of the next. A cluster is simply a maximal set of density-connected points. The number of clusters isn't a parameter — it emerges naturally from the data's density structure.

Here's the full algorithm from scratch. The key operation is BFS cluster expansion: start from an unvisited core point, explore its neighborhood, and keep expanding through any core points you find:

import numpy as np
from collections import deque

def dbscan(X, eps, min_pts):
    n = len(X)
    labels = np.full(n, -1)       # -1 means noise (unassigned)
    cluster_id = 0

    # Precompute pairwise distances (naive O(n^2) approach)
    dists = np.linalg.norm(X[:, None] - X[None], axis=2)

    for i in range(n):
        if labels[i] != -1:       # Already assigned
            continue

        # Find neighbors within eps
        neighbors = np.where(dists[i] <= eps)[0]

        if len(neighbors) < min_pts:
            continue               # Not a core point — leave as noise for now

        # Start a new cluster via BFS
        labels[i] = cluster_id
        queue = deque(neighbors[neighbors != i])

        while queue:
            j = queue.popleft()
            if labels[j] == -1:    # Was noise — claim it for this cluster
                labels[j] = cluster_id
            elif labels[j] != -1 and labels[j] != cluster_id:
                continue           # Already in another cluster
            else:
                continue

            j_neighbors = np.where(dists[j] <= eps)[0]
            if len(j_neighbors) >= min_pts:   # j is also a core point
                for k in j_neighbors:
                    if labels[k] == -1:
                        queue.append(k)

        cluster_id += 1

    return labels

# Test on moons — DBSCAN nails it
from sklearn.datasets import make_moons
X, _ = make_moons(300, noise=0.06, random_state=42)
labels = dbscan(X, eps=0.15, min_pts=5)

n_clusters = len(set(labels) - {-1})
n_noise = np.sum(labels == -1)
print(f"Clusters found: {n_clusters}, Noise points: {n_noise}")
# Clusters found: 2, Noise points: 3

No K parameter. No assumption about cluster shape. The two moons get perfectly separated because DBSCAN follows the dense chain of points along each crescent, and the sparse gap between them acts as a natural boundary.

Choosing eps and minPts

DBSCAN trades K-Means' cluster count parameter k for two different parameters: eps and minPts. Choosing them wisely is crucial. Get eps wrong and you'll either fragment real clusters into pieces (too small) or merge separate clusters into one blob (too large).

The k-distance graph method, proposed in the original DBSCAN paper, gives you a principled way to choose eps. The idea: for each point, compute the distance to its k-th nearest neighbor (where k = minPts). Sort these distances and plot them. Points inside dense clusters will have small k-distances; noise points will have large ones. The "elbow" in this curve — where the slope changes sharply — is your eps.

For minPts, rules of thumb have evolved since the original paper:

import numpy as np
from scipy.spatial import KDTree

def find_eps_elbow(X, k=5):
    """Use k-distance graph to find a good eps value."""
    tree = KDTree(X)
    # Query k+1 neighbors (includes the point itself)
    dists, _ = tree.query(X, k=k + 1)
    k_dists = dists[:, -1]              # Distance to k-th neighbor
    k_dists_sorted = np.sort(k_dists)

    # Find the elbow: point of maximum curvature
    # Approximate by looking for the biggest acceleration in the curve
    diffs = np.diff(k_dists_sorted)
    accel = np.diff(diffs)
    elbow_idx = np.argmax(accel) + 2
    eps_estimate = k_dists_sorted[elbow_idx]

    print(f"k-distance elbow at index {elbow_idx}/{len(X)}")
    print(f"Suggested eps: {eps_estimate:.4f}")
    print(f"Range around elbow: [{k_dists_sorted[elbow_idx-5]:.4f}, {k_dists_sorted[elbow_idx+5]:.4f}]")
    return eps_estimate

from sklearn.datasets import make_moons
X, _ = make_moons(300, noise=0.06, random_state=42)
eps = find_eps_elbow(X, k=5)
# k-distance elbow at index 283/300
# Suggested eps: 0.1412
# Range around elbow: [0.1178, 0.1830]

The elbow method works well when clusters have similar densities. But here's the fundamental limitation: eps is a global parameter. If your dataset has one tight cluster and one loose cluster, no single eps can handle both. Too small and the loose cluster fragments; too large and the tight cluster absorbs its neighbors. This single-scale problem is what motivates HDBSCAN, which we'll build in Section 5.

Complexity and Spatial Indexing

Our naive DBSCAN implementation has a glaring bottleneck: we precompute an O(n²) distance matrix. For 10,000 points, that's 100 million distances — and it only gets worse from there. The core operation in DBSCAN is the range query: "find all points within distance eps of point p." If we can accelerate that, the whole algorithm speeds up.

Enter the KD-tree. As we explored in KNN from Scratch, a KD-tree partitions space into axis-aligned regions, letting us prune vast swaths of points that can't possibly be within eps. Range queries drop from O(n) to O(log n) on average, making DBSCAN's total complexity O(n log n) instead of O(n²).

import numpy as np
from scipy.spatial import KDTree
from collections import deque
import time

def dbscan_kdtree(X, eps, min_pts):
    """DBSCAN using KD-tree for O(n log n) range queries."""
    tree = KDTree(X)
    n = len(X)
    labels = np.full(n, -1)
    cluster_id = 0

    for i in range(n):
        if labels[i] != -1:
            continue
        neighbors = tree.query_ball_point(X[i], eps)
        if len(neighbors) < min_pts:
            continue

        labels[i] = cluster_id
        queue = deque([j for j in neighbors if j != i])

        while queue:
            j = queue.popleft()
            if labels[j] == -1:
                labels[j] = cluster_id
            else:
                continue
            j_neighbors = tree.query_ball_point(X[j], eps)
            if len(j_neighbors) >= min_pts:
                for k in j_neighbors:
                    if labels[k] == -1:
                        queue.append(k)
        cluster_id += 1
    return labels

# Benchmark naive vs KD-tree
from sklearn.datasets import make_moons
for n in [500, 2000, 5000]:
    X, _ = make_moons(n, noise=0.06, random_state=42)

    t0 = time.time()
    dists = np.linalg.norm(X[:, None] - X[None], axis=2)  # naive distance matrix
    t_naive = time.time() - t0

    t0 = time.time()
    dbscan_kdtree(X, eps=0.15, min_pts=5)
    t_tree = time.time() - t0

    print(f"n={n:>5d} | Naive dist matrix: {t_naive:.3f}s | KD-tree DBSCAN: {t_tree:.3f}s")
# n=  500 | Naive dist matrix: 0.004s | KD-tree DBSCAN: 0.008s
# n= 2000 | Naive dist matrix: 0.051s | KD-tree DBSCAN: 0.025s
# n= 5000 | Naive dist matrix: 0.318s | KD-tree DBSCAN: 0.059s

The crossover happens around n=1000. Beyond that, the KD-tree version pulls ahead rapidly. One caveat: KD-trees degrade in high dimensions (the "curse of dimensionality" strikes again). Above ~20 dimensions, ball trees are a better choice, and above ~50, you may want approximate methods like locality-sensitive hashing. For the 2D clustering we're doing here, KD-trees are perfect.

HDBSCAN — The Parameter-Free Evolution

DBSCAN's biggest weakness is that single global eps. Imagine a dataset with three clusters: one tight group of 50 points, one medium-spread group of 200, and one loose cloud of 100. No single radius can capture all three — it's like trying to use one shoe size for the whole family.

HDBSCAN (Hierarchical DBSCAN), introduced by Campello, Moulavi, and Sander in 2013, solves this elegantly. Instead of picking one density threshold, it examines all possible thresholds and extracts the clusters that persist the longest. The key insight is the mutual reachability distance:

dmreach(a, b) = max(core(a), core(b), d(a, b))

Where core(p) is the distance from point p to its k-th nearest neighbor. This transformation "stretches" distances in sparse regions while leaving dense regions unchanged. Sparse outliers become far from everything; dense cluster members stay close together.

The algorithm proceeds in five steps:

  1. Compute core distances — for each point, the distance to its k-th nearest neighbor
  2. Build the mutual reachability graph — edge weights are dmreach(a, b)
  3. Construct a minimum spanning tree — this captures the cluster hierarchy
  4. Build the condensed tree — only keep splits where both children have >= min_cluster_size points
  5. Extract clusters using stability — stability(C) = Σ(1/λdeath - 1/λbirth) for each point in C
import numpy as np
from scipy.spatial import KDTree

def hdbscan_core(X, min_pts=5):
    """Compute HDBSCAN core distances and mutual reachability MST."""
    tree = KDTree(X)
    n = len(X)

    # Step 1: Core distances (distance to k-th nearest neighbor)
    dists, _ = tree.query(X, k=min_pts + 1)
    core_dists = dists[:, -1]  # k-th neighbor distance

    # Step 2: Mutual reachability distance for all pairs
    # For efficiency, only compute for KD-tree neighbors
    # d_mreach(a, b) = max(core(a), core(b), d(a, b))
    # Build MST using Prim's algorithm on mutual reachability graph

    in_tree = np.zeros(n, dtype=bool)
    min_edge = np.full(n, np.inf)
    min_edge[0] = 0
    parent = np.full(n, -1)
    mst_edges = []

    for _ in range(n):
        # Find the node not yet in tree with smallest edge
        candidates = np.where(~in_tree)[0]
        u = candidates[np.argmin(min_edge[candidates])]
        in_tree[u] = True

        if parent[u] != -1:
            w = max(core_dists[u], core_dists[parent[u]],
                    np.linalg.norm(X[u] - X[parent[u]]))
            mst_edges.append((parent[u], u, w))

        # Update distances for neighbors
        for v in range(n):
            if in_tree[v]:
                continue
            d = np.linalg.norm(X[u] - X[v])
            mreach = max(core_dists[u], core_dists[v], d)
            if mreach < min_edge[v]:
                min_edge[v] = mreach
                parent[v] = u

    return core_dists, sorted(mst_edges, key=lambda e: e[2])

# Demonstrate on varying-density data
rng = np.random.RandomState(42)
tight = rng.randn(80, 2) * 0.15 + [0, 0]
medium = rng.randn(120, 2) * 0.4 + [3, 0]
loose = rng.randn(60, 2) * 0.8 + [6, 2]
X = np.vstack([tight, medium, loose])

# HDBSCAN reveals the density hierarchy through core distances
core_dists, mst = hdbscan_core(X, min_pts=5)
print(f"Core dists — tight: {core_dists[:80].mean():.3f}, "
      f"medium: {core_dists[80:200].mean():.3f}, "
      f"loose: {core_dists[200:].mean():.3f}")
print(f"MST edges: {len(mst)}, weight range [{mst[0][2]:.3f}, {mst[-1][2]:.3f}]")
# Core dists — tight: 0.079, medium: 0.214, loose: 0.501
# MST edges: 259, weight range [0.039, 1.826]
# The MST's heaviest edges bridge between clusters — cutting them reveals structure

# Compare: DBSCAN with a single eps can't handle all three densities
from collections import deque
def dbscan_simple(X, eps, min_pts):
    tree = KDTree(X)
    labels = np.full(len(X), -1)
    cid = 0
    for i in range(len(X)):
        if labels[i] != -1: continue
        nb = tree.query_ball_point(X[i], eps)
        if len(nb) < min_pts: continue
        labels[i] = cid
        q = deque([j for j in nb if j != i])
        while q:
            j = q.popleft()
            if labels[j] != -1: continue
            labels[j] = cid
            jnb = tree.query_ball_point(X[j], eps)
            if len(jnb) >= min_pts:
                q.extend(k for k in jnb if labels[k] == -1)
        cid += 1
    return labels

for eps in [0.2, 0.5, 1.0]:
    lbl = dbscan_simple(X, eps, 5)
    nc = len(set(lbl) - {-1})
    nn = np.sum(lbl == -1)
    print(f"DBSCAN eps={eps}: {nc} clusters, {nn} noise")
# DBSCAN eps=0.2: 1 clusters, 168 noise  (only finds tight cluster)
# DBSCAN eps=0.5: 2 clusters, 48 noise   (merges tight, finds medium)
# DBSCAN eps=1.0: 2 clusters, 5 noise    (merges tight+medium, finds loose)

No single eps finds all three clusters. HDBSCAN examines the full hierarchy of density levels and extracts the clusters that remain stable across the widest range of thresholds. The practical library (hdbscan by McInnes, Healy, and Astels) implements optimized versions of these steps and typically needs just one parameter: min_cluster_size.

When to Use What — Clustering Decision Guide

With K-Means, DBSCAN, HDBSCAN, and other clustering algorithms in your toolkit, how do you choose? Here's a practical decision framework:

Algorithm Cluster Shape Parameters Handles Noise? Complexity
K-Means Spherical K (known) No O(nKt)
DBSCAN Arbitrary eps, minPts Yes O(n log n)*
HDBSCAN Arbitrary min_cluster_size Yes O(n log n)*
GMMs Elliptical K, covariance type Soft O(nK²d)

* With spatial indexing. Naive implementations are O(n²).

Use K-Means when your clusters are roughly spherical, similarly-sized, and you know K. It's fast and well-understood. See K-Means from Scratch.

Use DBSCAN when clusters have arbitrary shapes, you don't know the number of clusters, your data has noise/outliers, and the clusters have roughly similar densities.

Use HDBSCAN when clusters have varying densities, you want minimal parameter tuning, and you need robust noise handling. It's the most flexible option and often the best default for exploratory clustering.

Use GMMs when clusters overlap and you want probabilistic ("soft") assignments. See EM from Scratch.

Density-based clustering is one of those ideas that feels obvious in retrospect: instead of forcing data into symmetric blobs, follow where it's dense and let the gaps speak for themselves. DBSCAN gives you that with two parameters, and HDBSCAN takes it further by examining every density scale at once. Play with the demos below to build intuition for how these algorithms see your data differently.

Try It: Cluster Explorer

Click on the canvas to place points, then switch between algorithms to see how they cluster differently. Core points are filled circles, border points are smaller, noise is marked with ×.

Clusters: 0 Noise: 0 Points: 0

Try It: eps Sensitivity Visualizer

This dataset has three clusters with different densities. Drag the eps slider to see how DBSCAN's single-scale limitation plays out, then toggle HDBSCAN to see it handle all densities at once.

Clusters: 0 Noise: 0 Largest: 0 Smallest: 0

References & Further Reading