Hidden Markov Models from Scratch
Markov Chains — The Memoryless Foundation
Tomorrow's weather depends on today's weather, but not on yesterday's. A chess engine evaluates the current board position, not the sequence of moves that led there. A web user's next click depends on the current page, not their entire browsing history. This elegant simplification — the future depends only on the present, not the past — is called the Markov property, and it's one of the most powerful assumptions in all of mathematics.
A Markov chain formalizes this idea. It's a sequence of random variables X₁, X₂, X₃, ... where each variable's distribution depends only on the previous one:
P(Xₜ₊₁ | Xₜ, Xₜ₋₁, ..., X₁) = P(Xₜ₊₁ | Xₜ)
The entire chain is described by a transition matrix A where Aᵢⱼ = P(next = j | current = i) — the probability of moving from state i to state j. Each row sums to 1 (you have to go somewhere). A fascinating property emerges: as you raise A to higher powers, Aⁿ converges to a matrix where every row is identical — the stationary distribution. No matter where the chain starts, it eventually settles into the same long-run frequencies.
Let's build a weather Markov chain and verify this convergence:
import numpy as np
# Transition matrix: Sunny, Rainy, Cloudy
# A[i][j] = P(next=j | current=i)
A = np.array([
[0.6, 0.2, 0.2], # Sunny -> 60% Sunny, 20% Rainy, 20% Cloudy
[0.3, 0.4, 0.3], # Rainy -> 30% Sunny, 40% Rainy, 30% Cloudy
[0.4, 0.3, 0.3], # Cloudy -> 40% Sunny, 30% Rainy, 30% Cloudy
])
states = ["Sunny", "Rainy", "Cloudy"]
# Simulate 10,000 steps
np.random.seed(42)
current = 0 # start Sunny
counts = np.zeros(3)
for _ in range(10_000):
counts[current] += 1
current = np.random.choice(3, p=A[current])
empirical = counts / counts.sum()
# Stationary distribution: left eigenvector with eigenvalue 1
eigenvalues, eigenvectors = np.linalg.eig(A.T)
idx = np.argmin(np.abs(eigenvalues - 1.0))
stationary = np.real(eigenvectors[:, idx])
stationary = stationary / stationary.sum()
for i, s in enumerate(states):
print(f"{s:<8} empirical={empirical[i]:.3f} stationary={stationary[i]:.3f}")
# Sunny empirical=0.462 stationary=0.462
# Rainy empirical=0.276 stationary=0.276
# Cloudy empirical=0.263 stationary=0.262
The empirical frequencies from simulating 10,000 steps match the theoretical stationary distribution almost exactly. Markov chains are the workhorses behind MCMC sampling and the foundation of Markov Decision Processes in reinforcement learning. But their real power emerges when we add a twist: what if you can't observe the states directly?
Hidden States — When You Can't See What's Happening
Here's the scenario that motivates everything that follows. Imagine a casino with two dice: a Fair die (each face has probability 1/6) and a Loaded die (face 6 has probability 1/2, the other five faces split the remaining 1/2). The casino secretly switches between the two dice as you play. You can see the rolls, but not which die produced them. Your job: figure out when the casino was cheating.
This is a Hidden Markov Model. The Markov chain of hidden states (Fair/Loaded) follows transition probabilities the casino controls. Each hidden state emits an observable symbol (the dice roll) according to its own emission distribution. The HMM is fully specified by three components:
- A — Transition matrix: P(next hidden state | current hidden state)
- B — Emission matrix: P(observation | hidden state)
- π — Initial distribution: P(starting hidden state)
Together, λ = (A, B, π) defines the model. Every HMM gives rise to three fundamental questions:
- Evaluation: Given a sequence of observations, how likely is it under this model? (Forward algorithm)
- Decoding: What's the most likely sequence of hidden states? (Viterbi algorithm)
- Learning: What model parameters best explain the observations? (Baum-Welch algorithm)
Let's define our casino HMM and generate some data:
import numpy as np
class HMM:
def __init__(self, A, B, pi):
self.A = np.array(A, dtype=float) # N x N transition
self.B = np.array(B, dtype=float) # N x M emission
self.pi = np.array(pi, dtype=float) # N initial probs
self.N = self.A.shape[0] # num hidden states
self.M = self.B.shape[1] # num observation symbols
def generate(self, T, seed=42):
rng = np.random.RandomState(seed)
states, obs = [], []
state = rng.choice(self.N, p=self.pi)
for _ in range(T):
states.append(state)
obs.append(rng.choice(self.M, p=self.B[state]))
state = rng.choice(self.N, p=self.A[state])
return np.array(states), np.array(obs)
# Dishonest casino: state 0 = Fair, state 1 = Loaded
casino = HMM(
A=[[0.95, 0.05], # Fair stays Fair 95%, switches to Loaded 5%
[0.10, 0.90]], # Loaded stays Loaded 90%, switches to Fair 10%
B=[[1/6]*6, # Fair die: uniform
[0.1, 0.1, 0.1, 0.1, 0.1, 0.5]], # Loaded die: 6 has prob 0.5
pi=[0.5, 0.5]
)
true_states, observations = casino.generate(T=30)
faces = observations + 1 # 0-indexed to 1-indexed
print("Rolls: ", " ".join(str(f) for f in faces))
print("Truth: ", " ".join("F" if s == 0 else "L" for s in true_states))
# Rolls: 3 1 6 6 6 2 6 6 5 4 6 1 6 6 6 6 6 3 2 5 1 6 3 4 2 3 5 1 3 6
# Truth: F F L L L L L L L L L F F L L L L L F F F F F F F F F F F L
Notice the pattern: when the Loaded die is active, you see clusters of 6s. But individual rolls are ambiguous — a Fair die can roll a 6 too. The algorithms we'll build next solve the problem of reasoning under this uncertainty. They're the conceptual ancestors of attention mechanisms and recurrent neural networks, and understanding them illuminates why those modern architectures work the way they do.
The Forward Algorithm — How Likely Is This Sequence?
The first question: given our casino model and a sequence of dice rolls, what's the probability that this model produced those exact observations? This might seem academic, but it's the foundation for everything else — comparing models, training parameters, and even decoding hidden states.
The brute-force approach: enumerate every possible hidden state sequence, compute the joint probability of that sequence and the observations, then sum them all up. With N hidden states and T time steps, that's Nᵀ sequences — exponential and completely intractable.
The Forward algorithm uses dynamic programming to collapse this exponential sum into polynomial time. Define the forward variable:
αₜ(i) = P(O₁, O₂, ..., Oₜ, Xₜ = i | λ)
This is the probability of observing the first t symbols and being in hidden state i at time t. The recursion is elegant:
- Base: α₁(i) = πᵢ · bᵢ(O₁) — start in state i and emit the first observation
- Recurse: αₜ₊₁(j) = [Σᵢ αₜ(i) · aᵢⱼ] · bⱼ(Oₜ₊₁) — sum over all ways to reach state j
- Answer: P(O | λ) = Σᵢ αᵀ(i) — sum over all ending states
This runs in O(N² · T) time instead of O(Nᵀ · T). For our 2-state casino over 30 rolls, that's 120 operations instead of over a billion. We work in log-space to avoid numerical underflow when multiplying many small probabilities:
def forward(hmm, obs):
"""Forward algorithm in log-space. Returns log(alpha) trellis and log P(O|model)."""
T = len(obs)
log_alpha = np.full((T, hmm.N), -np.inf)
# Base case: t=0
for i in range(hmm.N):
log_alpha[0, i] = np.log(hmm.pi[i]) + np.log(hmm.B[i, obs[0]])
# Recursion: t=1..T-1
for t in range(1, T):
for j in range(hmm.N):
log_sum = np.logaddexp.reduce(
log_alpha[t-1, :] + np.log(hmm.A[:, j])
)
log_alpha[t, j] = log_sum + np.log(hmm.B[j, obs[t]])
log_prob = np.logaddexp.reduce(log_alpha[T-1, :])
return log_alpha, log_prob
log_alpha, log_prob = forward(casino, observations)
print(f"Log P(observations | casino HMM) = {log_prob:.2f}")
print(f"P(observations) = {np.exp(log_prob):.2e}")
# Log P(observations | casino HMM) = -40.17
# P(observations) = 3.57e-18
That probability looks tiny, but any specific sequence of 30 dice rolls has a low probability. What matters is comparison: an HMM with only a Fair die would assign even lower probability to a sequence full of 6s, because it can't explain the clustering. The forward algorithm lets us score how well a model fits the data — the first step toward learning.
The Viterbi Algorithm — Finding the Most Likely Hidden Path
Now the question every detective asks: given the dice rolls we observed, what's the single most likely sequence of hidden states? When was the casino using the Fair die, and when did it switch to Loaded?
The Viterbi algorithm is strikingly similar to the Forward algorithm, with one crucial change: replace the sum (Σ) with a max. Instead of accumulating total probability, we track the best probability at each state and remember which predecessor led to it:
- Base: δ₁(i) = πᵢ · bᵢ(O₁)
- Recurse: δₜ(j) = maxᵢ [δₜ₋₁(i) · aᵢⱼ] · bⱼ(Oₜ)
- Backpointer: ψₜ(j) = argmaxᵢ [δₜ₋₁(i) · aᵢⱼ]
After processing all T observations, we start at the highest-scoring final state and follow the backpointers to reconstruct the globally optimal path. This is not a greedy algorithm — each decision at time t considers the entire future through the dynamic programming table. It's the same idea behind beam search in modern language models, but Viterbi gives the provably optimal solution for HMMs:
def viterbi(hmm, obs):
"""Viterbi algorithm in log-space. Returns best state sequence and its log-prob."""
T = len(obs)
log_delta = np.full((T, hmm.N), -np.inf)
psi = np.zeros((T, hmm.N), dtype=int)
# Base case
for i in range(hmm.N):
log_delta[0, i] = np.log(hmm.pi[i]) + np.log(hmm.B[i, obs[0]])
# Recursion
for t in range(1, T):
for j in range(hmm.N):
scores = log_delta[t-1, :] + np.log(hmm.A[:, j])
psi[t, j] = np.argmax(scores)
log_delta[t, j] = scores[psi[t, j]] + np.log(hmm.B[j, obs[t]])
# Backtrack
path = np.zeros(T, dtype=int)
path[T-1] = np.argmax(log_delta[T-1, :])
for t in range(T-2, -1, -1):
path[t] = psi[t+1, path[t+1]]
return path, np.max(log_delta[T-1, :])
decoded, log_prob = viterbi(casino, observations)
labels = ["F", "L"]
print("Rolls: ", " ".join(str(f) for f in faces))
print("Viterbi: ", " ".join(labels[s] for s in decoded))
print("Truth: ", " ".join(labels[s] for s in true_states))
accuracy = np.mean(decoded == true_states)
print(f"Accuracy: {accuracy:.1%}")
# Rolls: 3 1 6 6 6 2 6 6 5 4 6 1 6 6 6 6 6 3 2 5 1 6 3 4 2 3 5 1 3 6
# Viterbi: F F L L L L L L L L L F L L L L L F F F F F F F F F F F F L
# Truth: F F L L L L L L L L L F F L L L L L F F F F F F F F F F F L
# Accuracy: 93.3%
The Viterbi decoder nails 93% of the hidden states. It correctly identifies the long Loaded stretch in positions 3–11 and the Fair stretch in positions 19–29. The few errors happen at transition boundaries, where a single roll is ambiguous — exactly where you'd expect uncertainty.
The Backward Algorithm and Posterior Decoding
Viterbi gives us the single best sequence, but sometimes we want something richer: the probability of being in each hidden state at each time step. For that, we need the Backward algorithm, the mirror image of the Forward algorithm.
The backward variable βₜ(i) = P(Oₜ₊₁, ..., Oᵀ | Xₜ = i, λ) captures the probability of future observations given the current state. The recursion runs right to left:
- Base: βᵀ(i) = 1 for all states (no future observations)
- Recurse: βₜ(i) = Σⱼ aᵢⱼ · bⱼ(Oₜ₊₁) · βₜ₊₁(j)
Combining forward and backward gives the posterior probability of each hidden state at each time step: P(Xₜ = i | O, λ) = αₜ(i) · βₜ(i) / P(O | λ). Think of it as attention weights over hidden states — at each time step, the model distributes probability mass across possible states based on all the evidence (past and future).
def backward(hmm, obs):
"""Backward algorithm in log-space. Returns log(beta) trellis."""
T = len(obs)
log_beta = np.full((T, hmm.N), -np.inf)
log_beta[T-1, :] = 0.0 # log(1) = 0
for t in range(T-2, -1, -1):
for i in range(hmm.N):
log_beta[t, i] = np.logaddexp.reduce(
np.log(hmm.A[i, :]) + np.log(hmm.B[:, obs[t+1]]) + log_beta[t+1, :]
)
return log_beta
def posterior_decode(hmm, obs):
"""Compute posterior state probabilities and decode."""
log_alpha, log_prob = forward(hmm, obs)
log_beta = backward(hmm, obs)
# gamma[t, i] = P(X_t = i | O, model)
log_gamma = log_alpha + log_beta - log_prob
gamma = np.exp(log_gamma)
# Posterior decoding: pick most likely state at each t
path = np.argmax(gamma, axis=1)
return gamma, path
gamma, posterior_path = posterior_decode(casino, observations)
print("Rolls: ", " ".join(str(f) for f in faces))
print("Posterior: ", " ".join(labels[s] for s in posterior_path))
print("Viterbi: ", " ".join(labels[s] for s in decoded))
print("Truth: ", " ".join(labels[s] for s in true_states))
print(f"\nP(Loaded) at each step:")
print(" ".join(f"{gamma[t,1]:.2f}" for t in range(len(observations))))
# Posterior and Viterbi agree on most positions
# P(Loaded) is high (>0.5) during loaded stretches, low during fair stretches
Posterior decoding maximizes the expected number of correctly labeled states, while Viterbi maximizes the probability of the entire sequence being correct. In practice they often agree, but they can diverge at state boundaries. The posterior probabilities are especially useful for visualizing uncertainty — rather than a hard binary label, you get a continuous confidence score at every time step.
Baum-Welch — Learning HMM Parameters with EM
So far, we assumed the casino's transition and emission probabilities were known. In reality, they're exactly what you need to learn. The Baum-Welch algorithm solves this: given only the observed dice rolls (no hidden state labels), find the parameters λ = (A, B, π) that maximize the observation likelihood P(O | λ).
If you've read our post on Expectation-Maximization, you already know the pattern. Baum-Welch is EM, specialized for HMMs:
- E-step: Use the forward-backward algorithm to compute expected sufficient statistics:
- ξₜ(i, j) = P(Xₜ = i, Xₜ₊₁ = j | O, λ) — expected transitions from i to j at time t
- γₜ(i) = P(Xₜ = i | O, λ) — expected occupancy of state i at time t
- M-step: Update parameters using weighted counts:
- aᵢⱼ = Σₜ ξₜ(i, j) / Σₜ γₜ(i) — expected transitions i→j / expected visits to i
- bⱼ(k) = Σₜ γₜ(j) · [Oₜ = k] / Σₜ γₜ(j) — expected emissions of k from j / expected visits to j
- πᵢ = γ₁(i) — expected initial state
The log-likelihood monotonically increases at every iteration — guaranteed by the EM framework. Let's implement it and watch random parameters converge to the truth:
def baum_welch(obs, N, M, max_iter=100, seed=0):
"""Learn HMM parameters from observations using EM."""
rng = np.random.RandomState(seed)
# Random initialization
A = rng.dirichlet(np.ones(N), size=N)
B = rng.dirichlet(np.ones(M), size=N)
pi = rng.dirichlet(np.ones(N))
hmm = HMM(A, B, pi)
T = len(obs)
for iteration in range(max_iter):
# E-step: forward-backward
log_alpha, log_prob = forward(hmm, obs)
log_beta = backward(hmm, obs)
# Compute gamma: P(X_t = i | O)
log_gamma = log_alpha + log_beta - log_prob
gamma = np.exp(log_gamma)
# Compute xi: P(X_t = i, X_{t+1} = j | O)
xi = np.zeros((T-1, N, N))
for t in range(T-1):
for i in range(N):
for j in range(N):
xi[t, i, j] = np.exp(
log_alpha[t, i] + np.log(hmm.A[i, j])
+ np.log(hmm.B[j, obs[t+1]]) + log_beta[t+1, j]
- log_prob
)
# M-step: update parameters
hmm.pi = gamma[0]
for i in range(N):
denom = gamma[:-1, i].sum()
for j in range(N):
hmm.A[i, j] = xi[:, i, j].sum() / denom
for k in range(M):
mask = (obs == k)
hmm.B[i, k] = gamma[mask, i].sum() / gamma[:, i].sum()
if iteration % 20 == 0:
print(f"Iter {iteration:3d}: log P(O) = {log_prob:.2f}")
print(f"Iter {iteration:3d}: log P(O) = {log_prob:.2f}")
return hmm
# Generate longer sequence for better learning
_, long_obs = casino.generate(T=500, seed=99)
learned = baum_welch(long_obs, N=2, M=6, max_iter=100)
print(f"\nLearned transitions:\n{learned.A.round(3)}")
print(f"True transitions:\n{casino.A}")
print(f"\nLearned emissions:\n{learned.B.round(3)}")
print(f"True emissions:\n{np.array(casino.B).round(3)}")
# Learned parameters converge close to true values
Starting from random parameters, Baum-Welch recovers the casino's true transition and emission probabilities from nothing but the raw dice rolls. The log-likelihood increases monotonically, just as the EM guarantee promises. Note the subtlety: with only two states, the algorithm might swap the labels (calling "Fair" what we called "Loaded" and vice versa). This label symmetry is inherent to unsupervised learning — the model discovers the structure but can't name it.
One important caveat: Baum-Welch converges to a local maximum, not necessarily the global one. In practice, you run it multiple times with different random seeds and keep the model with the highest final log-likelihood. With enough data (we used 500 rolls above), the local optima tend to be good ones.
HMMs in the Wild — Speech, DNA, and Beyond
The dishonest casino is a toy example, but the same three algorithms — Forward, Viterbi, Baum-Welch — powered some of the most important applications in computing history:
- Speech recognition (1980–2012): Hidden states are phonemes, observations are acoustic features (spectral coefficients). Each word is an HMM, and recognizing speech means finding the sequence of word-HMMs that best explains the audio signal. This was the dominant approach for three decades.
- Computational biology: In gene finding, hidden states are functional regions (exons, introns, intergenic), and observations are DNA nucleotides. The Viterbi algorithm traces the most likely gene structure along a chromosome.
- Part-of-speech tagging: Hidden states are grammatical categories (noun, verb, adjective), observations are words. Baum-Welch learns tag transition patterns and word-tag emission probabilities from raw text.
- Financial regime detection: Hidden states are market regimes (bull, bear, sideways), observations are daily returns. Posterior decoding estimates which regime the market is in today.
So why did deep learning replace HMMs? Three fundamental limitations:
- First-order memory: The Markov assumption means only one step of history matters. RNNs and transformers can learn arbitrarily long dependencies.
- Discrete states: HMMs require a fixed, finite set of hidden states. Neural networks learn continuous hidden representations that can encode far richer information.
- Hand-designed features: HMM observations must be pre-defined symbols or Gaussian features. Deep learning learns its own features end-to-end.
But the ideas live on. The Forward algorithm's sum-over-paths computation appears in CTC (Connectionist Temporal Classification) for speech and OCR. Viterbi decoding is used in convolutional error-correcting codes. And the Baum-Welch insight — that you can learn latent structure from observations alone using EM — is the conceptual ancestor of every unsupervised and self-supervised technique in modern ML.
References & Further Reading
- Lawrence Rabiner — A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition (IEEE, 1989) — the definitive HMM tutorial, still the best introduction 35 years later
- Christopher Bishop — Pattern Recognition and Machine Learning, Chapter 13 (2006) — rigorous treatment of sequential data with graphical models
- Kevin Murphy — Machine Learning: A Probabilistic Perspective, Chapter 17 (2012) — comprehensive coverage including extensions to continuous emissions
- Durbin, Eddy, Krogh, Mitchison — Biological Sequence Analysis (Cambridge, 1998) — HMMs applied to protein and nucleic acid sequences
- Jurafsky & Martin — Speech and Language Processing, Chapter 8 (2024 draft) — HMMs for sequence labeling in NLP
Related DadOps posts: Expectation-Maximization from Scratch (Baum-Welch is EM for sequences), Bayesian Inference from Scratch (HMMs as graphical models), RNNs from Scratch (the neural successor to HMMs), Decoding Strategies from Scratch (Viterbi vs beam search), Information Theory from Scratch (entropy of Markov chains), Reinforcement Learning from Scratch (MDPs extend Markov chains), Naive Bayes from Scratch (HMMs generalize Naive Bayes to sequences)
Try It: Markov Chain State Explorer
Watch a token move through a 3-state Markov chain. Edit transition probabilities with the sliders and observe how the stationary distribution changes.
Try It: Viterbi Trellis Decoder
Click the dice to set an observation sequence, then step through the Viterbi algorithm. Watch it find the most likely Fair/Loaded path through the trellis.