← Back to Blog

Hidden Markov Models from Scratch

Markov Chains — The Memoryless Foundation

Tomorrow's weather depends on today's weather, but not on yesterday's. A chess engine evaluates the current board position, not the sequence of moves that led there. A web user's next click depends on the current page, not their entire browsing history. This elegant simplification — the future depends only on the present, not the past — is called the Markov property, and it's one of the most powerful assumptions in all of mathematics.

A Markov chain formalizes this idea. It's a sequence of random variables X₁, X₂, X₃, ... where each variable's distribution depends only on the previous one:

P(Xₜ₊₁ | Xₜ, Xₜ₋₁, ..., X₁) = P(Xₜ₊₁ | Xₜ)

The entire chain is described by a transition matrix A where Aᵢⱼ = P(next = j | current = i) — the probability of moving from state i to state j. Each row sums to 1 (you have to go somewhere). A fascinating property emerges: as you raise A to higher powers, Aⁿ converges to a matrix where every row is identical — the stationary distribution. No matter where the chain starts, it eventually settles into the same long-run frequencies.

Let's build a weather Markov chain and verify this convergence:

import numpy as np

# Transition matrix: Sunny, Rainy, Cloudy
# A[i][j] = P(next=j | current=i)
A = np.array([
    [0.6, 0.2, 0.2],  # Sunny  -> 60% Sunny, 20% Rainy, 20% Cloudy
    [0.3, 0.4, 0.3],  # Rainy  -> 30% Sunny, 40% Rainy, 30% Cloudy
    [0.4, 0.3, 0.3],  # Cloudy -> 40% Sunny, 30% Rainy, 30% Cloudy
])
states = ["Sunny", "Rainy", "Cloudy"]

# Simulate 10,000 steps
np.random.seed(42)
current = 0  # start Sunny
counts = np.zeros(3)
for _ in range(10_000):
    counts[current] += 1
    current = np.random.choice(3, p=A[current])

empirical = counts / counts.sum()

# Stationary distribution: left eigenvector with eigenvalue 1
eigenvalues, eigenvectors = np.linalg.eig(A.T)
idx = np.argmin(np.abs(eigenvalues - 1.0))
stationary = np.real(eigenvectors[:, idx])
stationary = stationary / stationary.sum()

for i, s in enumerate(states):
    print(f"{s:<8} empirical={empirical[i]:.3f}  stationary={stationary[i]:.3f}")
# Sunny    empirical=0.462  stationary=0.462
# Rainy    empirical=0.276  stationary=0.276
# Cloudy   empirical=0.263  stationary=0.262

The empirical frequencies from simulating 10,000 steps match the theoretical stationary distribution almost exactly. Markov chains are the workhorses behind MCMC sampling and the foundation of Markov Decision Processes in reinforcement learning. But their real power emerges when we add a twist: what if you can't observe the states directly?

Hidden States — When You Can't See What's Happening

Here's the scenario that motivates everything that follows. Imagine a casino with two dice: a Fair die (each face has probability 1/6) and a Loaded die (face 6 has probability 1/2, the other five faces split the remaining 1/2). The casino secretly switches between the two dice as you play. You can see the rolls, but not which die produced them. Your job: figure out when the casino was cheating.

This is a Hidden Markov Model. The Markov chain of hidden states (Fair/Loaded) follows transition probabilities the casino controls. Each hidden state emits an observable symbol (the dice roll) according to its own emission distribution. The HMM is fully specified by three components:

Together, λ = (A, B, π) defines the model. Every HMM gives rise to three fundamental questions:

  1. Evaluation: Given a sequence of observations, how likely is it under this model? (Forward algorithm)
  2. Decoding: What's the most likely sequence of hidden states? (Viterbi algorithm)
  3. Learning: What model parameters best explain the observations? (Baum-Welch algorithm)

Let's define our casino HMM and generate some data:

import numpy as np

class HMM:
    def __init__(self, A, B, pi):
        self.A = np.array(A, dtype=float)   # N x N transition
        self.B = np.array(B, dtype=float)   # N x M emission
        self.pi = np.array(pi, dtype=float) # N initial probs
        self.N = self.A.shape[0]            # num hidden states
        self.M = self.B.shape[1]            # num observation symbols

    def generate(self, T, seed=42):
        rng = np.random.RandomState(seed)
        states, obs = [], []
        state = rng.choice(self.N, p=self.pi)
        for _ in range(T):
            states.append(state)
            obs.append(rng.choice(self.M, p=self.B[state]))
            state = rng.choice(self.N, p=self.A[state])
        return np.array(states), np.array(obs)

# Dishonest casino: state 0 = Fair, state 1 = Loaded
casino = HMM(
    A=[[0.95, 0.05],   # Fair stays Fair 95%, switches to Loaded 5%
       [0.10, 0.90]],  # Loaded stays Loaded 90%, switches to Fair 10%
    B=[[1/6]*6,                          # Fair die: uniform
       [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]], # Loaded die: 6 has prob 0.5
    pi=[0.5, 0.5]
)

true_states, observations = casino.generate(T=30)
faces = observations + 1  # 0-indexed to 1-indexed

print("Rolls: ", " ".join(str(f) for f in faces))
print("Truth: ", " ".join("F" if s == 0 else "L" for s in true_states))
# Rolls:  3 1 6 6 6 2 6 6 5 4 6 1 6 6 6 6 6 3 2 5 1 6 3 4 2 3 5 1 3 6
# Truth:  F F L L L L L L L L L F F L L L L L F F F F F F F F F F F L

Notice the pattern: when the Loaded die is active, you see clusters of 6s. But individual rolls are ambiguous — a Fair die can roll a 6 too. The algorithms we'll build next solve the problem of reasoning under this uncertainty. They're the conceptual ancestors of attention mechanisms and recurrent neural networks, and understanding them illuminates why those modern architectures work the way they do.

The Forward Algorithm — How Likely Is This Sequence?

The first question: given our casino model and a sequence of dice rolls, what's the probability that this model produced those exact observations? This might seem academic, but it's the foundation for everything else — comparing models, training parameters, and even decoding hidden states.

The brute-force approach: enumerate every possible hidden state sequence, compute the joint probability of that sequence and the observations, then sum them all up. With N hidden states and T time steps, that's Nᵀ sequences — exponential and completely intractable.

The Forward algorithm uses dynamic programming to collapse this exponential sum into polynomial time. Define the forward variable:

αₜ(i) = P(O₁, O₂, ..., Oₜ, Xₜ = i | λ)

This is the probability of observing the first t symbols and being in hidden state i at time t. The recursion is elegant:

This runs in O(N² · T) time instead of O(Nᵀ · T). For our 2-state casino over 30 rolls, that's 120 operations instead of over a billion. We work in log-space to avoid numerical underflow when multiplying many small probabilities:

def forward(hmm, obs):
    """Forward algorithm in log-space. Returns log(alpha) trellis and log P(O|model)."""
    T = len(obs)
    log_alpha = np.full((T, hmm.N), -np.inf)

    # Base case: t=0
    for i in range(hmm.N):
        log_alpha[0, i] = np.log(hmm.pi[i]) + np.log(hmm.B[i, obs[0]])

    # Recursion: t=1..T-1
    for t in range(1, T):
        for j in range(hmm.N):
            log_sum = np.logaddexp.reduce(
                log_alpha[t-1, :] + np.log(hmm.A[:, j])
            )
            log_alpha[t, j] = log_sum + np.log(hmm.B[j, obs[t]])

    log_prob = np.logaddexp.reduce(log_alpha[T-1, :])
    return log_alpha, log_prob

log_alpha, log_prob = forward(casino, observations)
print(f"Log P(observations | casino HMM) = {log_prob:.2f}")
print(f"P(observations) = {np.exp(log_prob):.2e}")
# Log P(observations | casino HMM) = -40.17
# P(observations) = 3.57e-18

That probability looks tiny, but any specific sequence of 30 dice rolls has a low probability. What matters is comparison: an HMM with only a Fair die would assign even lower probability to a sequence full of 6s, because it can't explain the clustering. The forward algorithm lets us score how well a model fits the data — the first step toward learning.

The Viterbi Algorithm — Finding the Most Likely Hidden Path

Now the question every detective asks: given the dice rolls we observed, what's the single most likely sequence of hidden states? When was the casino using the Fair die, and when did it switch to Loaded?

The Viterbi algorithm is strikingly similar to the Forward algorithm, with one crucial change: replace the sum (Σ) with a max. Instead of accumulating total probability, we track the best probability at each state and remember which predecessor led to it:

After processing all T observations, we start at the highest-scoring final state and follow the backpointers to reconstruct the globally optimal path. This is not a greedy algorithm — each decision at time t considers the entire future through the dynamic programming table. It's the same idea behind beam search in modern language models, but Viterbi gives the provably optimal solution for HMMs:

def viterbi(hmm, obs):
    """Viterbi algorithm in log-space. Returns best state sequence and its log-prob."""
    T = len(obs)
    log_delta = np.full((T, hmm.N), -np.inf)
    psi = np.zeros((T, hmm.N), dtype=int)

    # Base case
    for i in range(hmm.N):
        log_delta[0, i] = np.log(hmm.pi[i]) + np.log(hmm.B[i, obs[0]])

    # Recursion
    for t in range(1, T):
        for j in range(hmm.N):
            scores = log_delta[t-1, :] + np.log(hmm.A[:, j])
            psi[t, j] = np.argmax(scores)
            log_delta[t, j] = scores[psi[t, j]] + np.log(hmm.B[j, obs[t]])

    # Backtrack
    path = np.zeros(T, dtype=int)
    path[T-1] = np.argmax(log_delta[T-1, :])
    for t in range(T-2, -1, -1):
        path[t] = psi[t+1, path[t+1]]

    return path, np.max(log_delta[T-1, :])

decoded, log_prob = viterbi(casino, observations)
labels = ["F", "L"]

print("Rolls:   ", " ".join(str(f) for f in faces))
print("Viterbi: ", " ".join(labels[s] for s in decoded))
print("Truth:   ", " ".join(labels[s] for s in true_states))
accuracy = np.mean(decoded == true_states)
print(f"Accuracy: {accuracy:.1%}")
# Rolls:    3 1 6 6 6 2 6 6 5 4 6 1 6 6 6 6 6 3 2 5 1 6 3 4 2 3 5 1 3 6
# Viterbi:  F F L L L L L L L L L F L L L L L F F F F F F F F F F F F L
# Truth:    F F L L L L L L L L L F F L L L L L F F F F F F F F F F F L
# Accuracy: 93.3%

The Viterbi decoder nails 93% of the hidden states. It correctly identifies the long Loaded stretch in positions 3–11 and the Fair stretch in positions 19–29. The few errors happen at transition boundaries, where a single roll is ambiguous — exactly where you'd expect uncertainty.

The Backward Algorithm and Posterior Decoding

Viterbi gives us the single best sequence, but sometimes we want something richer: the probability of being in each hidden state at each time step. For that, we need the Backward algorithm, the mirror image of the Forward algorithm.

The backward variable βₜ(i) = P(Oₜ₊₁, ..., Oᵀ | Xₜ = i, λ) captures the probability of future observations given the current state. The recursion runs right to left:

Combining forward and backward gives the posterior probability of each hidden state at each time step: P(Xₜ = i | O, λ) = αₜ(i) · βₜ(i) / P(O | λ). Think of it as attention weights over hidden states — at each time step, the model distributes probability mass across possible states based on all the evidence (past and future).

def backward(hmm, obs):
    """Backward algorithm in log-space. Returns log(beta) trellis."""
    T = len(obs)
    log_beta = np.full((T, hmm.N), -np.inf)
    log_beta[T-1, :] = 0.0  # log(1) = 0

    for t in range(T-2, -1, -1):
        for i in range(hmm.N):
            log_beta[t, i] = np.logaddexp.reduce(
                np.log(hmm.A[i, :]) + np.log(hmm.B[:, obs[t+1]]) + log_beta[t+1, :]
            )
    return log_beta

def posterior_decode(hmm, obs):
    """Compute posterior state probabilities and decode."""
    log_alpha, log_prob = forward(hmm, obs)
    log_beta = backward(hmm, obs)

    # gamma[t, i] = P(X_t = i | O, model)
    log_gamma = log_alpha + log_beta - log_prob
    gamma = np.exp(log_gamma)

    # Posterior decoding: pick most likely state at each t
    path = np.argmax(gamma, axis=1)
    return gamma, path

gamma, posterior_path = posterior_decode(casino, observations)

print("Rolls:     ", " ".join(str(f) for f in faces))
print("Posterior: ", " ".join(labels[s] for s in posterior_path))
print("Viterbi:   ", " ".join(labels[s] for s in decoded))
print("Truth:     ", " ".join(labels[s] for s in true_states))
print(f"\nP(Loaded) at each step:")
print(" ".join(f"{gamma[t,1]:.2f}" for t in range(len(observations))))
# Posterior and Viterbi agree on most positions
# P(Loaded) is high (>0.5) during loaded stretches, low during fair stretches

Posterior decoding maximizes the expected number of correctly labeled states, while Viterbi maximizes the probability of the entire sequence being correct. In practice they often agree, but they can diverge at state boundaries. The posterior probabilities are especially useful for visualizing uncertainty — rather than a hard binary label, you get a continuous confidence score at every time step.

Baum-Welch — Learning HMM Parameters with EM

So far, we assumed the casino's transition and emission probabilities were known. In reality, they're exactly what you need to learn. The Baum-Welch algorithm solves this: given only the observed dice rolls (no hidden state labels), find the parameters λ = (A, B, π) that maximize the observation likelihood P(O | λ).

If you've read our post on Expectation-Maximization, you already know the pattern. Baum-Welch is EM, specialized for HMMs:

The log-likelihood monotonically increases at every iteration — guaranteed by the EM framework. Let's implement it and watch random parameters converge to the truth:

def baum_welch(obs, N, M, max_iter=100, seed=0):
    """Learn HMM parameters from observations using EM."""
    rng = np.random.RandomState(seed)

    # Random initialization
    A = rng.dirichlet(np.ones(N), size=N)
    B = rng.dirichlet(np.ones(M), size=N)
    pi = rng.dirichlet(np.ones(N))
    hmm = HMM(A, B, pi)
    T = len(obs)

    for iteration in range(max_iter):
        # E-step: forward-backward
        log_alpha, log_prob = forward(hmm, obs)
        log_beta = backward(hmm, obs)

        # Compute gamma: P(X_t = i | O)
        log_gamma = log_alpha + log_beta - log_prob
        gamma = np.exp(log_gamma)

        # Compute xi: P(X_t = i, X_{t+1} = j | O)
        xi = np.zeros((T-1, N, N))
        for t in range(T-1):
            for i in range(N):
                for j in range(N):
                    xi[t, i, j] = np.exp(
                        log_alpha[t, i] + np.log(hmm.A[i, j])
                        + np.log(hmm.B[j, obs[t+1]]) + log_beta[t+1, j]
                        - log_prob
                    )

        # M-step: update parameters
        hmm.pi = gamma[0]
        for i in range(N):
            denom = gamma[:-1, i].sum()
            for j in range(N):
                hmm.A[i, j] = xi[:, i, j].sum() / denom
            for k in range(M):
                mask = (obs == k)
                hmm.B[i, k] = gamma[mask, i].sum() / gamma[:, i].sum()

        if iteration % 20 == 0:
            print(f"Iter {iteration:3d}: log P(O) = {log_prob:.2f}")

    print(f"Iter {iteration:3d}: log P(O) = {log_prob:.2f}")
    return hmm

# Generate longer sequence for better learning
_, long_obs = casino.generate(T=500, seed=99)
learned = baum_welch(long_obs, N=2, M=6, max_iter=100)

print(f"\nLearned transitions:\n{learned.A.round(3)}")
print(f"True transitions:\n{casino.A}")
print(f"\nLearned emissions:\n{learned.B.round(3)}")
print(f"True emissions:\n{np.array(casino.B).round(3)}")
# Learned parameters converge close to true values

Starting from random parameters, Baum-Welch recovers the casino's true transition and emission probabilities from nothing but the raw dice rolls. The log-likelihood increases monotonically, just as the EM guarantee promises. Note the subtlety: with only two states, the algorithm might swap the labels (calling "Fair" what we called "Loaded" and vice versa). This label symmetry is inherent to unsupervised learning — the model discovers the structure but can't name it.

One important caveat: Baum-Welch converges to a local maximum, not necessarily the global one. In practice, you run it multiple times with different random seeds and keep the model with the highest final log-likelihood. With enough data (we used 500 rolls above), the local optima tend to be good ones.

HMMs in the Wild — Speech, DNA, and Beyond

The dishonest casino is a toy example, but the same three algorithms — Forward, Viterbi, Baum-Welch — powered some of the most important applications in computing history:

So why did deep learning replace HMMs? Three fundamental limitations:

  1. First-order memory: The Markov assumption means only one step of history matters. RNNs and transformers can learn arbitrarily long dependencies.
  2. Discrete states: HMMs require a fixed, finite set of hidden states. Neural networks learn continuous hidden representations that can encode far richer information.
  3. Hand-designed features: HMM observations must be pre-defined symbols or Gaussian features. Deep learning learns its own features end-to-end.

But the ideas live on. The Forward algorithm's sum-over-paths computation appears in CTC (Connectionist Temporal Classification) for speech and OCR. Viterbi decoding is used in convolutional error-correcting codes. And the Baum-Welch insight — that you can learn latent structure from observations alone using EM — is the conceptual ancestor of every unsupervised and self-supervised technique in modern ML.

References & Further Reading

Related DadOps posts: Expectation-Maximization from Scratch (Baum-Welch is EM for sequences), Bayesian Inference from Scratch (HMMs as graphical models), RNNs from Scratch (the neural successor to HMMs), Decoding Strategies from Scratch (Viterbi vs beam search), Information Theory from Scratch (entropy of Markov chains), Reinforcement Learning from Scratch (MDPs extend Markov chains), Naive Bayes from Scratch (HMMs generalize Naive Bayes to sequences)

Try It: Markov Chain State Explorer

Watch a token move through a 3-state Markov chain. Edit transition probabilities with the sliders and observe how the stationary distribution changes.

Steps: 0 | Current: Sunny

Try It: Viterbi Trellis Decoder

Click the dice to set an observation sequence, then step through the Viterbi algorithm. Watch it find the most likely Fair/Loaded path through the trellis.

Click dice to edit: ⚀ ⚁ ⚅ ⚅ ⚅ ⚅ ⚂ ⚀
Click "Step" or "Auto-Run" to begin Viterbi decoding.