← Back to Blog

Expectation-Maximization from Scratch

The Incomplete Data Problem

K-means is one of the most popular algorithms in machine learning, but it has a dirty secret: it assumes every cluster is a perfect sphere of the same size. Real data laughs at this assumption. Customer segments overlap. Gene expression clusters are elongated ellipses. Anomaly distributions have wildly different spreads.

In 1977, Arthur Dempster, Nan Laird, and Donald Rubin published a paper that solved this problem so elegantly that it became one of the most-cited papers in all of statistics — the Expectation-Maximization algorithm. EM generalizes k-means from hard, circular cluster assignments to soft, probabilistic assignments with full covariance structure. And it does so with a convergence guarantee that k-means can only dream of.

Here’s the core problem. You observe data points X, and you suspect they were generated by a mixture of K probability distributions with unknown parameters θ. If you knew which distribution generated each point — the “latent variables” Z — maximum likelihood estimation would be straightforward. But Z is hidden. You have “incomplete data.”

Direct maximization of the log-likelihood log p(X|θ) = log Σ_Z p(X,Z|θ) is intractable because the logarithm of a sum doesn’t decompose into a sum of logarithms. EM’s insight is beautifully simple: alternate between inferring Z given current parameters (the E-step) and optimizing θ given the inferred Z (the M-step). Each step is easy. Together, they provably converge.

Let’s start by seeing exactly where k-means fails — and why we need something better.

import numpy as np

# Generate 3 clusters with DIFFERENT covariance structures
np.random.seed(42)
# Cluster 1: circular
c1 = np.random.randn(150, 2) * 0.8 + [0, 0]
# Cluster 2: elongated horizontally
c2 = np.random.randn(150, 2) @ [[2.5, 0], [0, 0.4]] + [5, 0]
# Cluster 3: elongated diagonally
angle = np.pi / 4
R = np.array([[np.cos(angle), -np.sin(angle)],
              [np.sin(angle),  np.cos(angle)]])
c3 = (np.random.randn(150, 2) @ [[2.0, 0], [0, 0.3]]) @ R.T + [2, 4]

X = np.vstack([c1, c2, c3])
true_labels = np.repeat([0, 1, 2], 150)

# Run k-means (Lloyd's algorithm)
centroids = X[np.random.choice(len(X), 3, replace=False)]
for _ in range(20):
    dists = np.linalg.norm(X[:, None] - centroids[None], axis=2)
    labels = np.argmin(dists, axis=1)
    centroids = np.array([X[labels == k].mean(axis=0) for k in range(3)])

# Measure accuracy (best permutation matching)
from itertools import permutations
acc = max(np.mean(labels == np.array(perm)[true_labels])
          for perm in permutations(range(3)))
print(f"K-means accuracy on elliptical clusters: {acc:.1%}")
# Output: K-means accuracy on elliptical clusters: ~78%
# K-means forces spherical Voronoi boundaries — misclassifying
# points in the elongated tails of non-circular clusters

K-means achieves roughly 78% accuracy here — not terrible, but it’s systematically wrong. It draws straight Voronoi boundaries equidistant between centroids, but our clusters are ellipses that extend far beyond where a sphere would end. Points in the elongated tails get assigned to the wrong cluster. We need a model that understands cluster shape, not just cluster center.

Gaussian Mixture Models — The Generative Story

A Gaussian Mixture Model (GMM) describes data as a weighted sum of K Gaussian distributions:

p(x) = Σk=1K πk · N(x | μk, Σk)

Each component k has three things: a mixing weight πk (how likely this component is chosen — they must sum to 1), a mean μk (where the cluster is centered), and a covariance matrix Σk (the shape and orientation of the cluster). Unlike k-means, which only has centroids, a GMM encodes the full geometry of each cluster.

Think of the generative process like this: to generate a single data point, first roll a K-sided die weighted by π1,...,πK to pick a component k. Then draw a sample from the Gaussian N(μk, Σk). The catch is that when we observe data, we only see the final point — the die roll (which component generated it) is hidden. This hidden variable zi is exactly what makes EM necessary.

import numpy as np

def make_gmm_data(seed=42):
    """Generate 2D data from a known 3-component GMM."""
    rng = np.random.RandomState(seed)
    n_samples = 500

    # True parameters (what EM will try to recover)
    weights = [0.35, 0.40, 0.25]
    means = [np.array([0, 0]),
             np.array([5, 0]),
             np.array([2, 4])]
    covs = [np.array([[0.8, 0.2], [0.2, 0.6]]),         # slightly tilted
            np.array([[2.5, 0.0], [0.0, 0.15]]),         # wide and flat
            np.array([[0.6, -0.5], [-0.5, 1.5]])]        # tilted other way

    X, z_true = [], []
    for i in range(n_samples):
        k = rng.choice(3, p=weights)
        x = rng.multivariate_normal(means[k], covs[k])
        X.append(x)
        z_true.append(k)
    return np.array(X), np.array(z_true), weights, means, covs

X, z_true, true_w, true_mu, true_cov = make_gmm_data()
print(f"Generated {len(X)} points from 3-component GMM")
print(f"Component sizes: {[np.sum(z_true==k) for k in range(3)]}")
# Output: Generated 500 points from 3-component GMM
# Output: Component sizes: [176, 195, 129]  (close to 35/40/25%)

We’ve generated 500 points from a known mixture. The three clusters have different shapes: one is slightly tilted, another is wide and flat, and the third leans the opposite direction. Now the question is: given only the points X (without the hidden labels z_true), can we recover the parameters?

The E-Step — Soft Assignments via Bayes’ Rule

The E-step answers: “Given current parameter estimates, how likely is it that each point came from each cluster?” This is a direct application of Bayes’ rule. The responsibility of component k for point xi is:

γ(zik) = πk · N(xi | μk, Σk) / Σj πj · N(xi | μj, Σj)

Read this as: the probability that point xi belongs to cluster k, given the current parameters. It’s the likelihood of xi under component k, weighted by k’s mixing weight, normalized so all responsibilities sum to 1.

This is where EM diverges fundamentally from k-means. K-means does hard assignment: each point belongs to exactly one cluster. EM does soft assignment: a point near the boundary between two clusters might be 60% cluster A and 40% cluster B. This soft assignment propagates uncertainty through the algorithm, leading to much better parameter estimates.

import numpy as np

def multivariate_gaussian(X, mu, cov):
    """Evaluate the multivariate Gaussian density at each row of X."""
    d = X.shape[1]
    diff = X - mu
    cov_inv = np.linalg.inv(cov)
    det = np.linalg.det(cov)
    norm = 1.0 / (np.sqrt((2 * np.pi) ** d * det))
    exponent = -0.5 * np.sum(diff @ cov_inv * diff, axis=1)
    return norm * np.exp(exponent)

def e_step(X, weights, means, covs):
    """Compute responsibilities: gamma[i, k] = p(z_i = k | x_i, params)."""
    N, K = len(X), len(weights)
    gamma = np.zeros((N, K))
    for k in range(K):
        gamma[:, k] = weights[k] * multivariate_gaussian(X, means[k], covs[k])
    # Normalize so each row sums to 1
    gamma /= gamma.sum(axis=1, keepdims=True)
    return gamma

# Example: compute responsibilities with initial guesses
weights_init = [1/3, 1/3, 1/3]
means_init = [np.array([-1, -1]), np.array([4, 0]), np.array([1, 3])]
covs_init = [np.eye(2)] * 3

gamma = e_step(X, weights_init, means_init, covs_init)
# Show responsibilities for a few points
for i in [0, 100, 300]:
    print(f"Point {i}: responsibilities = [{gamma[i,0]:.3f}, "
          f"{gamma[i,1]:.3f}, {gamma[i,2]:.3f}]")
# Points near cluster centers have ~1.0 for one component
# Points in overlap regions show split responsibilities

Notice how points near a cluster center have a responsibility close to 1.0 for that cluster and near 0.0 for others. But points in the overlap region between clusters show genuinely split responsibilities — the algorithm honestly represents its uncertainty about which cluster generated them.

The M-Step — Updating Parameters

The M-step answers: “Given these soft assignments, what are the best parameters?” The answer turns out to be beautifully simple — just weighted versions of the standard maximum likelihood estimates, where the weights are the responsibilities from the E-step.

First, compute the effective number of points in each cluster:

Nk = Σi=1N γ(zik)

Then update each parameter:

A subtle but important detail: the covariance update uses the newly computed mean μknew, not the old one from the previous iteration. This is the correct derivation from maximizing the expected complete-data log-likelihood (see Bishop, Chapter 9).

def m_step(X, gamma):
    """Update GMM parameters using responsibilities as weights."""
    N, d = X.shape
    K = gamma.shape[1]

    # Effective number of points per component
    N_k = gamma.sum(axis=0)                      # shape (K,)

    # Update mixing weights
    weights = N_k / N                             # shape (K,)

    # Update means (responsibility-weighted centroids)
    means = []
    for k in range(K):
        mu_k = (gamma[:, k:k+1] * X).sum(axis=0) / N_k[k]
        means.append(mu_k)

    # Update covariances (using the NEW means)
    covs = []
    for k in range(K):
        diff = X - means[k]                       # (N, d)
        weighted_diff = gamma[:, k:k+1] * diff    # (N, d)
        cov_k = (weighted_diff.T @ diff) / N_k[k] # (d, d)
        # Add small regularization to prevent singular covariances
        cov_k += 1e-6 * np.eye(d)
        covs.append(cov_k)

    return list(weights), means, covs

# Run one M-step using the responsibilities from above
weights_new, means_new, covs_new = m_step(X, gamma)
print("Updated mixing weights:", [f"{w:.3f}" for w in weights_new])
print("Updated means:")
for k, mu in enumerate(means_new):
    print(f"  Component {k}: [{mu[0]:.2f}, {mu[1]:.2f}]")

After just one M-step, the means have already shifted toward the true cluster centers, and the covariance matrices have started capturing the elliptical shapes. But one iteration isn’t enough — we need to alternate E and M steps until convergence.

The Complete EM Algorithm

Here’s the full picture: initialize parameters (randomly, from k-means++, or with simple heuristics), then alternate E-step and M-step until the log-likelihood stops improving. The log-likelihood of the observed data is:

log L(θ) = Σi=1N log [ Σk=1K πk · N(xi | μk, Σk) ]

The most remarkable property of EM is its monotonic convergence guarantee: the log-likelihood never decreases from one iteration to the next. The proof uses Jensen’s inequality to establish an Evidence Lower BOund (ELBO). The E-step sets the auxiliary distribution to the true posterior, making the bound tight (ELBO = log-likelihood). The M-step then maximizes the ELBO, which can only increase it. Since the ELBO was tight at the start, the new log-likelihood must be at least as large as the old one.

One caveat: EM converges to a local maximum, not necessarily the global one. The log-likelihood surface for GMMs is non-convex with many local optima. The standard practice is to run EM multiple times with different random initializations and keep the result with the highest final log-likelihood.

import numpy as np

def fit_gmm(X, K, max_iter=100, tol=1e-6, seed=0):
    """Fit a K-component GMM using the EM algorithm."""
    rng = np.random.RandomState(seed)
    N, d = X.shape

    # Initialize: random means from data, identity covariances, uniform weights
    indices = rng.choice(N, K, replace=False)
    means = [X[idx].copy() for idx in indices]
    covs = [np.eye(d) * np.var(X, axis=0).mean() for _ in range(K)]
    weights = [1.0 / K] * K

    log_likelihoods = []

    for iteration in range(max_iter):
        # --- E-step ---
        gamma = np.zeros((N, K))
        for k in range(K):
            diff = X - means[k]
            cov_inv = np.linalg.inv(covs[k])
            det = np.linalg.det(covs[k])
            norm = 1.0 / np.sqrt((2 * np.pi) ** d * det)
            exponent = -0.5 * np.sum(diff @ cov_inv * diff, axis=1)
            gamma[:, k] = weights[k] * norm * np.exp(exponent)
        gamma_sum = gamma.sum(axis=1, keepdims=True)
        gamma /= gamma_sum

        # --- Log-likelihood ---
        ll = np.sum(np.log(gamma_sum.ravel()))
        log_likelihoods.append(ll)
        if len(log_likelihoods) > 1 and ll - log_likelihoods[-2] < tol:
            break

        # --- M-step ---
        N_k = gamma.sum(axis=0)
        weights = list(N_k / N)
        new_means = []
        new_covs = []
        for k in range(K):
            mu_k = (gamma[:, k:k+1] * X).sum(axis=0) / N_k[k]
            new_means.append(mu_k)
            diff = X - mu_k
            cov_k = (gamma[:, k:k+1] * diff).T @ diff / N_k[k]
            cov_k += 1e-6 * np.eye(d)
            new_covs.append(cov_k)
        means = new_means
        covs = new_covs

    return weights, means, covs, gamma, log_likelihoods

# Fit a 3-component GMM to our elliptical cluster data
w, mu, cov, gamma, lls = fit_gmm(X, K=3, seed=7)
print(f"Converged in {len(lls)} iterations")
print(f"Log-likelihood: {lls[0]:.1f} -> {lls[-1]:.1f} (monotonic increase)")
print(f"Recovered weights: {[f'{v:.3f}' for v in w]}")
for k in range(3):
    print(f"Component {k}: mean=[{mu[k][0]:.2f}, {mu[k][1]:.2f}]")

# Verify monotonic increase
diffs = [lls[i+1] - lls[i] for i in range(len(lls)-1)]
print(f"All LL increases non-negative: {all(d >= -1e-10 for d in diffs)}")
# Output: True — EM's convergence guarantee holds

Watch the log-likelihood climb: it starts low, rises steeply in the first few iterations as the algorithm finds the rough cluster structure, then levels off as it fine-tunes parameters. Every single step increases the log-likelihood — never a decrease. This is the ELBO guarantee in action.

K-Means is Hard EM

Now for the punchline: k-means isn’t just similar to EM — it’s a special case. Specifically, k-means corresponds to EM on a Gaussian mixture where:

  1. All covariance matrices are forced to σ2I (isotropic, same size)
  2. In the limit σ2 → 0, soft responsibilities collapse to hard assignments

When σ2 is tiny, the Gaussian density becomes an extremely sharp spike at each mean. The responsibility γ(zik) approaches 1 for the nearest cluster and 0 for all others — exactly a hard assignment. The M-step mean update becomes an unweighted average of assigned points — exactly the k-means centroid update. This is established in Bishop’s Pattern Recognition and Machine Learning, Section 9.3.2.

So k-means is EM with two constraints: hard assignment and spherical equal-size clusters. Remove those constraints, and you get the full power of GMMs. The trade-off is more parameters (means + full covariance matrices + weights vs. just means) and greater sensitivity to initialization.

import numpy as np
from itertools import permutations

def run_kmeans(X, K, seed=0, max_iter=50):
    """Standard k-means for comparison."""
    rng = np.random.RandomState(seed)
    centroids = X[rng.choice(len(X), K, replace=False)]
    for _ in range(max_iter):
        dists = np.linalg.norm(X[:, None] - centroids[None], axis=2)
        labels = np.argmin(dists, axis=1)
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return labels

# Compare on our elliptical cluster dataset
km_labels = run_kmeans(X, 3, seed=7)
em_labels = np.argmax(gamma, axis=1)  # gamma from EM fit above

def best_accuracy(pred, true, K):
    return max(np.mean(pred == np.array(p)[true]) for p in permutations(range(K)))

km_acc = best_accuracy(km_labels, z_true, 3)
em_acc = best_accuracy(em_labels, z_true, 3)
print(f"K-means accuracy: {km_acc:.1%}")
print(f"EM/GMM  accuracy: {em_acc:.1%}")
# K-means: ~78% — struggles with elliptical clusters
# EM/GMM:  ~96% — captures the true cluster shapes

The accuracy gap is dramatic: roughly 78% for k-means versus 96% for EM. K-means draws straight Voronoi boundaries equidistant between centroids, cutting through the elongated clusters. EM draws elliptical decision boundaries that follow the true covariance structure. When clusters are spherical and well-separated, both algorithms perform identically — but the real world rarely obliges.

Model Selection — How Many Clusters?

K-means gives you no principled way to choose K (the k-means post covers the elbow method and silhouette scores — both are heuristics). GMMs offer something better: the Bayesian Information Criterion (BIC), which balances model fit against complexity:

BIC = −2 · log L + p · log N

Here p is the number of free parameters and N is the number of data points. Lower BIC is better. The first term rewards good fit (high log-likelihood), and the second term penalizes complexity (more parameters). A K-component GMM in d dimensions has:

For our example with K=3 components in d=2 dimensions: 6 (means) + 9 (covariances) + 2 (weights) = 17 parameters. Fitting K=6 would double the parameters, and BIC penalizes that complexity unless the extra components genuinely improve the fit.

import numpy as np

def compute_bic(X, K, seed=0):
    """Fit a GMM with K components and return BIC."""
    w, mu, cov, _, lls = fit_gmm(X, K, max_iter=200, seed=seed)
    N, d = X.shape
    n_params = K * d + K * d * (d + 1) // 2 + (K - 1)
    log_likelihood = lls[-1]
    bic = -2 * log_likelihood + n_params * np.log(N)
    return bic, log_likelihood, n_params

print(f"{'K':>2} | {'Params':>6} | {'Log-L':>10} | {'BIC':>10}")
print("-" * 40)
bics = []
for K in range(1, 7):
    bic, ll, p = compute_bic(X, K, seed=7)
    bics.append(bic)
    marker = " <-- best" if K == np.argmin(bics) + 1 and K > 1 else ""
    print(f"{K:2d} | {p:6d} | {ll:10.1f} | {bic:10.1f}{marker}")

best_K = np.argmin(bics) + 1
print(f"\nBIC selects K={best_K} components (true K=3)")
# BIC correctly identifies that 3 components best explain the data

BIC drops sharply from K=1 to K=3 as the model captures the three true clusters, then rises for K=4, 5, 6 as the extra components don’t justify their parameter cost. This is automatic, principled model selection — no squinting at elbow plots required.

Try It: EM Visualizer

Step through the EM algorithm manually. Click E-Step to update responsibilities (point colors shift), then M-Step to update parameters (ellipses reshape). Watch the log-likelihood climb monotonically.

Iteration: 0 | Log-likelihood: — | Click E-Step to begin

Try It: K-Means vs EM Race

Watch k-means (left) and EM (right) cluster the same data simultaneously. K-means draws hard Voronoi boundaries; EM draws soft elliptical contours. Try elliptical clusters to see EM’s advantage.

Step: 0 | Select a dataset and click Step Both

References & Further Reading

Posts in this series: K-Means Clustering from Scratch (hard EM special case) · Bayesian Inference from Scratch (posterior estimation context) · Information Theory from Scratch (KL divergence used in ELBO proof) · Anomaly Detection from Scratch (density-based methods that GMMs complement)