Conditional Random Fields from Scratch
The Independent Label Problem
Your classifier tags each word independently. Feed it "The dog sat on the mat" and it returns: The/DET dog/NOUN sat/VERB on/PREP the/DET mat/VERB. That last tag is obviously wrong — "mat" is a noun — but your model doesn't know because it never looks at neighboring predictions. It evaluated "mat" in isolation, and since "mat" can be a verb (as in "to mat hair"), the model picked VERB based on some ambiguous feature.
This is the fundamental limitation of independent classifiers. Whether you're using logistic regression, a neural network with softmax output, or any per-token classifier, each prediction ignores the labels at every other position. But in sequence labeling tasks — POS tagging, named entity recognition, gene finding, image segmentation — labels have strong dependencies. Determiners precede nouns. B-PER tags must be followed by I-PER or O, never I-LOC. Start codons come before stop codons.
What we need is structured prediction: a model that outputs the entire label sequence at once, scoring not just how well each tag fits its word, but how well adjacent tags flow together. This is exactly what Conditional Random Fields do. Introduced by Lafferty, McCallum, and Pereira in 2001, CRFs model P(y₁, ..., yₙ | x₁, ..., xₙ) — the probability of the complete label sequence conditioned on the complete input sequence. They're the discriminative counterpart to Hidden Markov Models, and they dominated NLP for a decade before deep learning arrived.
From Logistic Regression to Linear-Chain CRFs
If you squint, a CRF is just logistic regression that got promoted from classifying individual items to classifying entire sequences. Let's trace the connection.
Logistic regression models the probability of a single label given an input: P(y | x) = exp(w · f(x, y)) / Z, where f(x, y) is a feature vector and Z = ∑ₖ exp(w · f(x, k)) normalizes over all possible labels. This is a log-linear model — the log probability is linear in the features.
A linear-chain CRF extends this to sequences. Instead of scoring one label, we score an entire label sequence y = (y₁, ..., yₙ) by summing contributions from two types of features:
- Emission features (unary potentials): How well does label
yᵢmatch the input at positioni? These are exactly like logistic regression features — things like "the word is 'dog' and the label is NOUN." - Transition features (pairwise potentials): How well does label
yᵢfollow labelyᵢ₋₁? These capture sequential dependencies — "DET followed by NOUN is common, VERB followed by VERB is rare."
The CRF probability of a label sequence is:
P(y | x) = exp(∑ᵢ [E(yᵢ, x, i) + T(yᵢ, yᵢ₋₁)]) / Z(x)
Here E(yᵢ, x, i) is the emission score for label yᵢ at position i, T(yᵢ, yᵢ₋₁) is the transition score between consecutive labels, and Z(x) is the partition function that normalizes over all possible label sequences.
How does this compare to an HMM? An HMM is generative: it models the joint probability P(x, y) as a product of transition and emission probabilities. A CRF is discriminative: it models P(y | x) directly using arbitrary feature weights. The CRF advantage is flexibility — you can throw in overlapping features like "the word ends in -ing," "the previous word is a determiner," and "the word is capitalized" without worrying about their independence. HMMs can't do this because they need P(xᵢ | yᵢ) to be a proper probability distribution.
import numpy as np
class LinearChainCRF:
def __init__(self, n_labels, n_features):
self.n_labels = n_labels
# Emission weights: (n_labels, n_features)
self.W = np.zeros((n_labels, n_features))
# Transition weights: (n_labels, n_labels)
# T[j, k] = score of transitioning from label j to label k
self.T = np.zeros((n_labels, n_labels))
def emission_scores(self, x):
"""Compute emission scores for each position and label."""
# x: (seq_len, n_features), returns (seq_len, n_labels)
return x @ self.W.T
def score(self, x, y):
"""Score a complete label sequence y given input x."""
emit = self.emission_scores(x)
total = 0.0
for i in range(len(y)):
total += emit[i, y[i]] # emission at position i
if i > 0:
total += self.T[y[i-1], y[i]] # transition from y[i-1] to y[i]
return total
# Example: 4 POS tags (0=DET, 1=NOUN, 2=VERB, 3=ADJ)
tags = ["DET", "NOUN", "VERB", "ADJ"]
crf = LinearChainCRF(n_labels=4, n_features=5)
# Set transition weights to reflect English POS patterns
crf.T = np.array([
# DET NOUN VERB ADJ
[-1.0, 2.0, -0.5, 1.5], # after DET: NOUN or ADJ likely
[-0.5, -0.5, 2.0, 0.0], # after NOUN: VERB likely
[ 1.0, 1.0, -1.0, 0.5], # after VERB: DET or NOUN likely
[-0.5, 2.0, -0.5, -1.0], # after ADJ: NOUN likely
])
# Dummy input: 4 words, 5 features each
x = np.random.default_rng(42).standard_normal((4, 5))
crf.W = np.random.default_rng(7).standard_normal((4, 5))
seq1 = [0, 1, 2, 1] # DET NOUN VERB NOUN (natural)
seq2 = [0, 2, 2, 2] # DET VERB VERB VERB (unnatural)
print(f"Score of '{' '.join(tags[t] for t in seq1)}': {crf.score(x, seq1):.2f}")
print(f"Score of '{' '.join(tags[t] for t in seq2)}': {crf.score(x, seq2):.2f}")
# Output:
# Score of 'DET NOUN VERB NOUN': 7.84
# Score of 'DET VERB VERB VERB': -2.93
The natural sequence scores much higher because the transition weights reward DET→NOUN and NOUN→VERB while penalizing VERB→VERB. This is the power of structured prediction: the model considers the entire path, not just individual stops.
The Partition Function and Forward Algorithm
We have a scoring function, but to turn scores into probabilities we need the partition function: Z(x) = ∑ᵧ exp(score(x, y)), which sums over all possible label sequences. With K labels and n positions, there are Kⁿ possible sequences. For our tiny 4-tag, 4-word example that's just 256 sequences. But a realistic scenario — 45 POS tags on a 30-word sentence — gives 45³⁰ ≈ 10⁵⁰. Brute force is impossible.
The rescue comes from dynamic programming, specifically the forward algorithm — the same idea used in HMMs. We define the forward variable α(i, k) as the log-sum-exp of scores of all partial sequences that end with label k at position i.
The recurrence is:
- Base case:
α(0, k) = E(k, x, 0)— just the emission score at the first position - Recursion:
α(i, k) = E(k, x, i) + logsumexpⱼ[α(i−1, j) + T(j, k)] - Result:
log Z(x) = logsumexpₖ[α(n−1, k)]
Each step combines all paths arriving at position i with label k by summing (in log-space via logsumexp) over all possible previous labels. The total cost is O(n · K²) — linear in sequence length, quadratic in the number of labels. That's 30 × 45² = 60,750 operations for our 45-tag sentence, versus 10⁵⁰ for brute force.
from scipy.special import logsumexp
def forward_algorithm(emit, T):
"""Compute log Z(x) using the forward algorithm.
emit: (seq_len, n_labels) emission scores
T: (n_labels, n_labels) transition scores, T[j,k] = j -> k
Returns: log Z(x) and forward table alpha
"""
n, K = emit.shape
alpha = np.full((n, K), -np.inf)
# Base case: first position has only emission scores
alpha[0] = emit[0]
# Fill forward table
for i in range(1, n):
for k in range(K):
# All paths arriving at (i, k): previous label j, transition j->k
alpha[i, k] = emit[i, k] + logsumexp(alpha[i-1] + T[:, k])
# Total log partition function
log_Z = logsumexp(alpha[-1])
return log_Z, alpha
# Verify against brute force on our small example
from itertools import product
emit = crf.emission_scores(x)
log_Z_forward, alpha = forward_algorithm(emit, crf.T)
# Brute force: enumerate all 4^4 = 256 sequences
all_scores = []
for seq in product(range(4), repeat=4):
all_scores.append(crf.score(x, list(seq)))
log_Z_brute = logsumexp(all_scores)
print(f"log Z (forward): {log_Z_forward:.6f}")
print(f"log Z (brute force): {log_Z_brute:.6f}")
print(f"Match: {np.isclose(log_Z_forward, log_Z_brute)}")
# Output:
# log Z (forward): 9.512743
# log Z (brute force): 9.512743
# Match: True
The forward algorithm gives the exact same answer as brute force, but in O(n·K²) time instead of O(Kⁿ). This is the same algorithmic trick that makes HMM inference tractable — the chain structure means we only ever need to consider pairs of adjacent labels, not the full combinatorial explosion.
Viterbi Decoding — Finding the Best Label Sequence
Training gives us weights. Inference asks: given a new input x, what's the most probable label sequence? Formally, we want argmaxᵧ P(y | x) = argmaxᵧ score(x, y) (the partition function Z(x) is the same for all y, so it doesn't affect the argmax).
We face the same exponential search space, and the same DP trick saves us. The Viterbi algorithm is identical to the forward algorithm with one change: replace logsumexp with max. Instead of summing over all paths to compute total probability, we keep only the best path.
- DP variable:
δ(i, k)= maximum score of any partial sequence ending with labelkat positioni - Backpointer:
ψ(i, k)= which previous labeljachieved the max - Traceback: Start at
yₙ = argmaxₖ δ(n−1, k), then follow backpointers:yᵢ₋₁ = ψ(i, yᵢ)
def viterbi_decode(emit, T):
"""Find the highest-scoring label sequence.
Returns: best score and best label sequence
"""
n, K = emit.shape
delta = np.full((n, K), -np.inf)
psi = np.zeros((n, K), dtype=int) # backpointers
# Base case
delta[0] = emit[0]
# Forward pass: max instead of logsumexp
for i in range(1, n):
for k in range(K):
scores = delta[i-1] + T[:, k] # score from each previous label
psi[i, k] = np.argmax(scores) # best previous label
delta[i, k] = emit[i, k] + scores[psi[i, k]]
# Traceback: follow backpointers from the end
y_best = np.zeros(n, dtype=int)
y_best[-1] = np.argmax(delta[-1])
for i in range(n - 2, -1, -1):
y_best[i] = psi[i + 1, y_best[i + 1]]
return delta[-1].max(), y_best
best_score, best_seq = viterbi_decode(emit, crf.T)
greedy_seq = np.argmax(emit, axis=1) # independent per-position argmax
print(f"Viterbi: {' '.join(tags[t] for t in best_seq)} (score: {best_score:.2f})")
print(f"Greedy: {' '.join(tags[t] for t in greedy_seq)} (score: {crf.score(x, greedy_seq):.2f})")
# Output:
# Viterbi: DET NOUN VERB NOUN (score: 7.84)
# Greedy: DET NOUN NOUN NOUN (score: 5.26)
The greedy approach picks the best label at each position independently and produces NOUN at position 2 even though NOUN→NOUN has a poor transition score. Viterbi finds the globally optimal sequence by considering the entire path, choosing VERB at position 2 because the NOUN→VERB transition adds enough score to overcome a slightly lower emission. This is exactly the kind of mistake that CRFs fix: when local evidence is ambiguous, global structure resolves it.
Learning CRF Parameters — Gradient of the Log-Likelihood
We've defined the model and built inference. Now we need to learn the emission weights W and transition weights T from labeled data. The objective is the conditional log-likelihood:
L(w) = ∑ log P(y | x) = ∑ [score(x, y) − log Z(x)]
The gradient of this objective has a beautifully intuitive form:
∇L = (observed feature counts) − (expected feature counts under the model)
The observed counts are trivial: just look at your labeled data and count how many times each feature fires. For emission weights, count how many times label k appears at a position with feature vector xᵢ. For transition weights, count how many times label j is followed by label k.
The expected counts are harder. We need the marginal probabilities P(yᵢ = k | x) (node marginals) and P(yᵢ = k, yᵢ₋₁ = j | x) (edge marginals) under the current model. Computing these requires the forward-backward algorithm: combine the forward variables α (which summarize everything to the left) with backward variables β (which summarize everything to the right).
The backward variable β(i, k) is the log-sum-exp of scores of all partial sequences from position i+1 to the end, given that position i has label k. It fills right-to-left, mirroring the forward pass. With both in hand, the marginals drop out:
- Node marginal:
P(yᵢ = k | x) = exp(α(i, k) + β(i, k) − log Z) - Edge marginal:
P(yᵢ = k, yᵢ₋₁ = j | x) = exp(α(i−1, j) + T(j,k) + E(k,x,i) + β(i,k) − log Z)
def forward_backward(emit, T):
"""Compute node and edge marginals via forward-backward."""
n, K = emit.shape
# Forward pass (already implemented above)
alpha = np.full((n, K), -np.inf)
alpha[0] = emit[0]
for i in range(1, n):
for k in range(K):
alpha[i, k] = emit[i, k] + logsumexp(alpha[i-1] + T[:, k])
log_Z = logsumexp(alpha[-1])
# Backward pass
beta = np.full((n, K), -np.inf)
beta[-1] = 0.0 # log(1) = 0
for i in range(n - 2, -1, -1):
for j in range(K):
beta[i, j] = logsumexp(T[j, :] + emit[i+1] + beta[i+1])
# Node marginals: P(y_i = k | x)
node_marginals = np.exp(alpha + beta - log_Z)
# Edge marginals: P(y_i = k, y_{i-1} = j | x)
edge_marginals = np.zeros((n - 1, K, K))
for i in range(1, n):
for j in range(K):
for k in range(K):
edge_marginals[i-1, j, k] = np.exp(
alpha[i-1, j] + T[j, k] + emit[i, k] + beta[i, k] - log_Z
)
return node_marginals, edge_marginals, log_Z
def train_crf(X_train, Y_train, n_labels, n_features, lr=0.1, epochs=50, reg=0.01):
"""Train a CRF with SGD on labeled sequences."""
crf = LinearChainCRF(n_labels, n_features)
for epoch in range(epochs):
total_ll = 0.0
for x, y in zip(X_train, Y_train):
emit = crf.emission_scores(x)
node_marg, edge_marg, log_Z = forward_backward(emit, crf.T)
score = crf.score(x, y)
total_ll += score - log_Z
# Gradient for emission weights
grad_W = np.zeros_like(crf.W)
for i in range(len(y)):
grad_W[y[i]] += x[i] # observed
for k in range(n_labels):
grad_W[k] -= node_marg[i, k] * x[i] # expected
# Gradient for transition weights
grad_T = np.zeros_like(crf.T)
for i in range(1, len(y)):
grad_T[y[i-1], y[i]] += 1.0 # observed
grad_T -= edge_marg.sum(axis=0) # expected
# SGD update with L2 regularization
crf.W += lr * (grad_W - reg * crf.W)
crf.T += lr * (grad_T - reg * crf.T)
if (epoch + 1) % 10 == 0:
avg_ll = total_ll / len(X_train)
print(f"Epoch {epoch+1}: avg log-likelihood = {avg_ll:.3f}")
return crf
The gradient says: if a feature fires in the training data more than the model expects, increase its weight. If the model expects it more than it actually occurs, decrease it. At convergence, the observed and expected counts match — the model perfectly captures the feature statistics of the training data (modulo regularization). This is identical to the gradient structure in logistic regression, just extended to sequences.
CRFs in Practice — Features and the Neural Era
The power of CRFs comes from feature engineering. In classical NLP, a feature template like "current word + current label" automatically generates thousands of binary features: is_word_the_AND_label_DET, is_word_dog_AND_label_NOUN, and so on. Stack multiple templates — word identity, suffix, capitalization, context window — and you get a rich feature space that captures the nuances of language.
Common feature templates for POS tagging:
- Word identity: the word itself (most informative but sparse)
- Suffixes: last 2-3 characters ("-ing" → VERB, "-tion" → NOUN)
- Capitalization: starts with uppercase (proper nouns, sentence starts)
- Word shape: pattern of letters/digits ("Xxxxx", "dd-dd")
- Context: previous/next word (determiners before nouns)
In the modern era, a fascinating pattern emerged: BiLSTM-CRF architectures (Huang, Xu & Yu, 2015). Instead of hand-crafting features, a bidirectional LSTM learns emission scores automatically from raw text. But the CRF layer on top still handles the structured output, learning transition patterns between labels. The neural network replaces feature engineering; the CRF replaces the softmax output layer. This combination dominated NER benchmarks for years, and tools like spaCy and Flair still use CRF output layers today.
When should you use a CRF instead of just slapping softmax on each position? When label dependencies are strong and systematic. In NER, the BIO tagging scheme (Begin, Inside, Outside) has hard constraints: I-PER can only follow B-PER or I-PER, never appear after O. A softmax output doesn't know this. A CRF learns it from data.
# Train CRF vs independent classifier on synthetic POS data
rng = np.random.default_rng(42)
vocab = {"the": 0, "a": 1, "dog": 2, "cat": 3, "sat": 4, "ran": 5,
"big": 6, "red": 7, "on": 8, "mat": 9}
# Tag patterns: DET (ADJ) NOUN VERB PREP DET NOUN
templates = [
([0, 2, 4, 8, 0, 9], [0, 1, 2, 0, 0, 1]), # the dog sat on the mat
([1, 3, 5, 8, 0, 9], [0, 1, 2, 0, 0, 1]), # a cat ran on the mat
([0, 6, 2, 4, 8, 1, 9], [0, 3, 1, 2, 0, 0, 1]), # the big dog sat on a mat
([0, 7, 3, 5, 8, 0, 9], [0, 3, 1, 2, 0, 0, 1]), # the red cat ran on the mat
]
# Generate 40 training sequences with slight feature noise
X_train, Y_train = [], []
for _ in range(10):
for words, labels in templates:
n_feat = len(vocab)
x = np.zeros((len(words), n_feat))
for i, w in enumerate(words):
x[i, w] = 1.0
x[i] += rng.normal(0, 0.05, n_feat) # add noise
X_train.append(x)
Y_train.append(labels)
trained_crf = train_crf(X_train, Y_train, n_labels=4, n_features=len(vocab),
lr=0.05, epochs=50, reg=0.01)
# Check learned transition matrix
print("\nLearned transition weights (rows=from, cols=to):")
print(" DET NOUN VERB ADJ")
for i, name in enumerate(["DET ", "NOUN", "VERB", "ADJ "]):
print(f" {name} [{', '.join(f'{v:+.1f}' for v in trained_crf.T[i])}]")
# Test: decode a new sequence
x_test = np.zeros((6, len(vocab)))
for i, w in enumerate([0, 3, 4, 8, 1, 9]): # the cat sat on a mat
x_test[i, w] = 1.0
_, pred = viterbi_decode(trained_crf.emission_scores(x_test), trained_crf.T)
print(f"\nPrediction: {' '.join(tags[t] for t in pred)}")
print(f"Expected: DET NOUN VERB DET DET NOUN")
The learned transition matrix tells the story: DET→NOUN and DET→ADJ get high weights (determiners precede nouns and adjectives), NOUN→VERB is strong (subjects precede verbs), and VERB→VERB stays low (consecutive verbs are rare). These patterns emerge automatically from data — we never told the model English grammar rules.
References & Further Reading
- Lafferty, McCallum & Pereira 2001 — Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data — the original CRF paper that started it all
- Sutton & McCallum 2012 — An Introduction to Conditional Random Fields — the most accessible CRF tutorial, covering theory and practice
- Rabiner 1989 — A Tutorial on Hidden Markov Models and Selected Applications — the classic HMM reference, helpful for understanding CRFs' generative counterpart
- Huang, Xu & Yu 2015 — Bidirectional LSTM-CRF Models for Sequence Tagging — the architecture that combined neural feature learning with CRF structure
- Viterbi 1967 — Error Bounds for Convolutional Codes — origin of the Viterbi algorithm, originally for decoding communication signals
Try It: CRF Sequence Tagger
Click tags above each word to cycle through POS labels. Watch how transition scores change the total CRF score compared to the emission-only (independent) score.
Try It: Transition Matrix Explorer
Adjust transition weights to see how they affect the Viterbi-decoded POS tags. Click cells to increase weight, right-click to decrease.