Information Theory from Scratch: The Math Behind Every Loss Function
1. What Is Information? — Shannon's Surprise
Every time your neural network computes a loss, it's speaking a language invented in 1948 by a 32-year-old mathematician at Bell Labs. Cross-entropy, KL divergence, perplexity — these aren't arbitrary choices. They're theorems. And they all flow from a single, beautiful question: when someone tells you something, how much did you actually learn?
Claude Shannon answered this by connecting information to surprise. Consider two headlines:
- "The sun rose today" — you already knew this would happen. Zero surprise, zero information.
- "A meteor hit the office" — you had no idea. Enormous surprise, enormous information.
The pattern: information is inversely related to probability. Events you expect carry little information. Events that shock you carry a lot. Shannon formalized this with a single equation:
I(x) = -log₂(p(x))
This gives information in bits. A fair coin flip has probability 0.5, so I = -log₂(0.5) = 1 bit. A fair six-sided die roll: I = -log₂(1/6) ≈ 2.58 bits. A loaded die where one face appears 90% of the time? That common face gives only 0.15 bits; the rare faces give much more. The negative log ensures that certain events (p=1) give zero bits, impossible events (p→0) give infinite bits, and everything else falls smoothly between.
But we rarely care about the information of a single event. We want the expected information over an entire distribution — the average surprise. This is entropy:
H(X) = -∑ p(x) log₂(p(x))
Entropy has two extremes. A uniform distribution (all outcomes equally likely) maximizes entropy — maximum uncertainty, maximum average surprise. A deterministic distribution (one outcome with probability 1) has zero entropy — no uncertainty, no surprise. If you've used entropy as a splitting criterion in decision trees, you've already seen this: information gain is the reduction in entropy after a split. And if you've adjusted softmax temperature, you've controlled the entropy of an output distribution — high temperature pushes toward uniform (high entropy), low temperature pushes toward deterministic (low entropy).
Code Block 1: Information Content and Entropy from Scratch
import numpy as np
def information_content(p):
"""Bits of information from an event with probability p."""
return -np.log2(np.clip(p, 1e-12, 1.0))
def entropy(probs):
"""Shannon entropy H(X) in bits."""
probs = np.array(probs, dtype=float)
probs = probs[probs > 0] # 0·log(0) = 0 by convention
return -np.sum(probs * np.log2(probs))
# --- Examples ---
# Fair coin: maximum entropy for 2 outcomes
print(f"Fair coin entropy: {entropy([0.5, 0.5]):.4f} bits") # 1.0
print(f"Biased coin (90/10): {entropy([0.9, 0.1]):.4f} bits") # 0.469
print(f"Certain coin (100/0): {entropy([1.0, 0.0]):.4f} bits") # 0.0
# Fair die: maximum entropy for 6 outcomes
print(f"Fair die entropy: {entropy([1/6]*6):.4f} bits") # 2.585
print(f"Loaded die: {entropy([0.5,0.1,0.1,0.1,0.1,0.1]):.4f}") # 2.161
# Binary entropy as a function of p
for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]:
h = entropy([p, 1-p]) if p > 0 else 0.0
print(f" p={p:.1f}: H = {h:.4f} bits")
# Peaks at p=0.5 (1.0 bit) — the famous parabolic curve
2. Cross-Entropy — The Cost of Being Wrong
Entropy tells you the minimum bits needed to encode messages from a distribution if you know the true distribution. But what if you don't? What if the true distribution is p but your model thinks it's q? You'll build your encoding scheme around q, and because q is wrong, you'll waste bits. The average bits per message under this mismatch is cross-entropy:
H(p, q) = -∑ p(x) log q(x)
The key insight: cross-entropy is always at least as large as entropy. H(p, q) ≥ H(p), with equality only when q = p. The gap between them has a name — KL divergence (we'll get there in the next section):
H(p, q) - H(p) = DKL(p || q)
This is the "wasted bits" — the extra cost of using the wrong model. Now here's why this matters for machine learning: when you train a classifier, the true distribution p is fixed (it's determined by the labels). So minimizing cross-entropy H(p, q) with respect to q is exactly the same as minimizing KL divergence DKL(p || q). The entropy term H(p) is a constant that vanishes in the gradient.
This is why cross-entropy is the standard classification loss. It's not an arbitrary choice — it's the information-theoretically optimal way to make your model's distribution match the true distribution. If you've read loss functions from scratch, cross-entropy was introduced as a practical loss. Now you see the deeper reason: it minimizes wasted bits.
For binary classification with true label y ∈ {0, 1} and predicted probability ŷ:
H = -[y log(ŷ) + (1-y) log(1-ŷ)]
For multi-class with one-hot true labels yk and predicted probabilities ŷk:
H = -∑k yk log(ŷk)
Since only one yk = 1 and the rest are zero, this simplifies to -log(ŷc) where c is the correct class. That's the negative log-probability of the right answer — exactly Shannon's information content applied to your model's confidence.
Code Block 2: Cross-Entropy Loss Derived from Information Theory
import numpy as np
def cross_entropy(p, q):
"""Cross-entropy H(p, q) in bits."""
p, q = np.array(p, dtype=float), np.array(q, dtype=float)
q = np.clip(q, 1e-12, 1.0)
mask = p > 0
return -np.sum(p[mask] * np.log2(q[mask]))
def binary_cross_entropy(y_true, y_pred):
"""BCE for a single sample (in nats, as used in ML)."""
y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def categorical_cross_entropy(y_onehot, y_pred):
"""Multi-class cross-entropy for a single sample (in nats)."""
y_pred = np.clip(y_pred, 1e-12, 1.0)
return -np.sum(y_onehot * np.log(y_pred))
# --- Show cross-entropy >= entropy ---
p = [0.7, 0.2, 0.1]
q_good = [0.6, 0.3, 0.1] # close to p
q_bad = [0.1, 0.1, 0.8] # very different from p
print(f"H(p) = {cross_entropy(p, p):.4f} bits (entropy)")
print(f"H(p,q_good)= {cross_entropy(p, q_good):.4f} bits")
print(f"H(p,q_bad) = {cross_entropy(p, q_bad):.4f} bits")
# --- Binary cross-entropy example ---
# True label y=1, model predicts 0.9 (confident and correct)
print(f"\nBCE(y=1, pred=0.9) = {binary_cross_entropy(1, 0.9):.4f}")
# True label y=1, model predicts 0.1 (confident and WRONG)
print(f"BCE(y=1, pred=0.1) = {binary_cross_entropy(1, 0.1):.4f}")
# --- Multi-class example ---
y_true = [0, 1, 0] # class 1 is correct
y_pred = [0.1, 0.8, 0.1]
print(f"\nCCE = {categorical_cross_entropy(y_true, y_pred):.4f}")
# This equals -log(0.8) — just the negative log of the correct class
3. KL Divergence — Measuring Distance Between Distributions
We've seen that cross-entropy measures the total bits needed when your model is wrong, and that the "wasted bits" portion is KL divergence. Let's formalize it:
DKL(p || q) = ∑ p(x) log(p(x) / q(x))
KL divergence has three important properties:
- Non-negative:
DKL(p || q) ≥ 0always (Gibbs' inequality). You can never waste negative bits. - Zero iff identical:
DKL(p || q) = 0if and only ifp = qeverywhere. - NOT symmetric:
DKL(p || q) ≠ DKL(q || p)in general. This is crucial.
That asymmetry isn't a bug — it's a feature, and understanding it is one of the deepest practical insights in modern ML. Consider fitting a simple model q (say, a single Gaussian) to a complex distribution p (say, bimodal with two peaks):
- Forward KL
DKL(p || q): penalizesqwhereverpis high. Result:qspreads out to cover all modes ofp, placing itself between the two peaks. This is mode-covering behavior. - Reverse KL
DKL(q || p): penalizesqwhereverqis high butpis low. Result:qcollapses onto one mode ofp, avoiding regions wherepis near zero. This is mode-seeking behavior.
This asymmetry shows up everywhere. VAEs minimize DKL(q(z|x) || p(z)) — forward KL that encourages the encoder to cover the prior. RLHF adds a KL penalty DKL(πnew || πref) to prevent the fine-tuned policy from straying too far from the reference model. And DPO implicitly optimizes the same KL-constrained objective but derives a closed-form solution. In every case, the direction of KL determines what the optimization prioritizes.
Code Block 3: KL Divergence and Its Two Directions
import numpy as np
def kl_divergence(p, q):
"""D_KL(p || q) in nats."""
p, q = np.array(p, dtype=float), np.array(q, dtype=float)
q = np.clip(q, 1e-12, 1.0)
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
# --- Asymmetry demonstration ---
p = [0.4, 0.3, 0.2, 0.1] # true distribution
q = [0.25, 0.25, 0.25, 0.25] # uniform model
print(f"D_KL(p || q) = {kl_divergence(p, q):.4f} nats (forward)")
print(f"D_KL(q || p) = {kl_divergence(q, p):.4f} nats (reverse)")
print(f"They differ! KL is NOT symmetric.\n")
# --- Forward vs reverse KL: fitting to bimodal ---
# Bimodal target: mixture of two Gaussians
x = np.linspace(-6, 6, 1000)
dx = x[1] - x[0]
p_bimodal = 0.5 * np.exp(-0.5*(x+2)**2) + 0.5 * np.exp(-0.5*(x-2)**2)
p_bimodal /= p_bimodal.sum() * dx # normalize
# Candidate Gaussians: centered between modes vs on one mode
q_cover = np.exp(-0.5 * (x/2.5)**2) # wide, centered at 0
q_cover /= q_cover.sum() * dx
q_seek = np.exp(-0.5 * (x-2)**2) # narrow, centered on right mode
q_seek /= q_seek.sum() * dx
p_d, qc_d, qs_d = p_bimodal * dx, q_cover * dx, q_seek * dx
fwd_cover = kl_divergence(p_d, qc_d)
fwd_seek = kl_divergence(p_d, qs_d)
rev_cover = kl_divergence(qc_d, p_d)
rev_seek = kl_divergence(qs_d, p_d)
print("Forward KL D_KL(p||q) — mode-covering wins:")
print(f" q_cover: {fwd_cover:.4f} q_seek: {fwd_seek:.4f}")
print("Reverse KL D_KL(q||p) — mode-seeking wins:")
print(f" q_cover: {rev_cover:.4f} q_seek: {rev_seek:.4f}")
4. Mutual Information — What Do Variables Share?
So far we've measured the information in a single variable (entropy) and the distance between two distributions (KL divergence). Mutual information answers a different question: how much does knowing one variable tell you about another?
I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
Equivalently, it's the KL divergence between the joint distribution and the product of marginals:
I(X; Y) = DKL(p(x,y) || p(x)p(y))
The intuition: if X and Y are independent, then p(x,y) = p(x)p(y) and I(X; Y) = 0. Knowing Y tells you nothing about X. If Y is a deterministic function of X, then I(X; Y) = H(X). Knowing Y tells you everything.
This makes mutual information a natural criterion for feature selection: rank features by how much information they share with the target. Features with high MI are the ones that actually help predict the label. If you've built decision trees, you've already used this idea — information gain at each split is exactly the mutual information between the splitting feature (thresholded) and the target label.
Mutual information also appears in modern deep learning. Contrastive learning methods like SimCLR and CLIP maximize a lower bound on MI (the InfoNCE objective) between different views of the same data. Good embeddings are ones that preserve the MI between raw input and the downstream task — they keep the relevant information and discard the noise.
Code Block 4: Mutual Information and Feature Selection
import numpy as np
def mutual_information(x, y, bins=20):
"""Estimate MI between continuous x and discrete y using binning."""
x_binned = np.digitize(x, np.linspace(x.min(), x.max(), bins))
# Joint and marginal distributions
joint = np.zeros((bins + 1, int(y.max()) + 1))
for xi, yi in zip(x_binned, y.astype(int)):
joint[xi, yi] += 1
joint /= joint.sum()
px = joint.sum(axis=1, keepdims=True)
py = joint.sum(axis=0, keepdims=True)
# MI = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
mask = joint > 0
independent = px * py
mi = np.sum(joint[mask] * np.log(joint[mask] / independent[mask]))
return mi
# --- Feature selection on synthetic data ---
rng = np.random.default_rng(42)
n = 2000
# Feature 0: strongly predictive of label
x0 = rng.normal(0, 1, n)
y = (x0 > 0).astype(int) # label determined by x0
# Feature 1: weakly predictive (noisy copy)
x1 = x0 + rng.normal(0, 3, n)
# Feature 2: pure noise (independent of label)
x2 = rng.normal(0, 1, n)
features = [x0, x1, x2]
names = ["x0 (strong)", "x1 (weak)", "x2 (noise)"]
print("Mutual Information with target label:")
for name, feat in zip(names, features):
mi = mutual_information(feat, y)
print(f" {name}: MI = {mi:.4f} nats")
# x0 has highest MI — it determines the label
# x2 has near-zero MI — it's independent noise
5. Perplexity — Evaluating Language Models
Every concept we've built leads to one of the most important metrics in natural language processing: perplexity. If your language model assigns cross-entropy H(p, q) to a test corpus, then perplexity is:
PPL = 2H(p, q)
Or equivalently (since ML uses natural log): PPL = ecross-entropy loss
The interpretation is elegant: a perplexity of 10 means the model is "as uncertain as if it were choosing uniformly among 10 options at each step." Lower perplexity means the model is more confident about correct predictions — it's less "perplexed" by the text.
The progression across model architectures is dramatic:
| Model | Perplexity | Interpretation |
|---|---|---|
| Random (uniform over vocab) | ~50,000 | Choosing randomly among all words |
| Unigram (word frequencies) | ~1,000 | Knows which words are common |
| Bigram | ~200 | Knows which word pairs occur |
| LSTM | ~60 | Captures long-range patterns |
| Transformer (GPT-2 scale) | ~20 | Deep contextual understanding |
That's a 2,500× improvement from random to transformer — each architectural advance learns increasingly sophisticated patterns about language. And notice: perplexity and cross-entropy loss are the same thing on different scales. When you watch your language model's training loss decrease from 10.8 to 3.0, it's going from PPL = e10.8 ≈ 50,000 (random) to PPL = e3.0 ≈ 20 (transformer). Every reduction in loss is a reduction in the model's "surprise" at real language.
One important caveat, discussed in tokenization from scratch: you can't directly compare perplexity across models with different tokenizers. A model with a 32K vocabulary and a model with a 50K vocabulary are playing different games — more tokens means higher "baseline" perplexity. Always compare perplexity within the same tokenization scheme, or normalize to per-character or per-byte perplexity for fair comparisons. This distinction matters for decoding strategies too — perplexity measures model quality, while decoding strategy determines generation quality.
Code Block 5: Computing Perplexity for Language Models
import numpy as np
def perplexity_from_logprobs(log_probs):
"""Perplexity from per-token log-probabilities (base e)."""
avg_nll = -np.mean(log_probs)
return np.exp(avg_nll)
def perplexity_from_loss(cross_entropy_loss):
"""Perplexity from average cross-entropy loss."""
return np.exp(cross_entropy_loss)
# --- Simulate three language models on the same 10-token sequence ---
vocab_size = 50000
# Model 1: Random (uniform over vocabulary)
random_logprobs = np.full(10, np.log(1.0 / vocab_size))
ppl_random = perplexity_from_logprobs(random_logprobs)
# Model 2: Bigram (moderately concentrated predictions)
bigram_logprobs = np.log([0.005, 0.01, 0.008, 0.02, 0.003,
0.015, 0.007, 0.012, 0.009, 0.006])
ppl_bigram = perplexity_from_logprobs(bigram_logprobs)
# Model 3: Transformer (highly concentrated predictions)
transformer_logprobs = np.log([0.15, 0.08, 0.12, 0.20, 0.05,
0.10, 0.07, 0.18, 0.09, 0.06])
ppl_transformer = perplexity_from_logprobs(transformer_logprobs)
print("Perplexity comparison:")
print(f" Random: PPL = {ppl_random:,.0f}")
print(f" Bigram: PPL = {ppl_bigram:,.0f}")
print(f" Transformer: PPL = {ppl_transformer:.1f}")
# --- Connection: loss and perplexity are interchangeable ---
loss = 3.0
print(f"\nCross-entropy loss {loss:.1f} = PPL {perplexity_from_loss(loss):.1f}")
loss = 1.5
print(f"Cross-entropy loss {loss:.1f} = PPL {perplexity_from_loss(loss):.1f}")
6. Information Theory in Modern AI — The Big Connections
Here's the payoff. Nearly every algorithm we've studied in this series is optimizing an information-theoretic objective. They all speak the same language — the language of bits, surprise, and divergence. Here's the unifying view:
| Algorithm | What It Optimizes | Information-Theoretic Form |
|---|---|---|
| Classification (any model) | Cross-entropy loss | Minimize DKL(ptrue || qmodel) |
| VAEs | ELBO | Minimize reconstruction H(p,q) + KL regularization |
| Diffusion models | Noise prediction loss | Minimize cross-entropy in noise-prediction space |
| Contrastive learning | InfoNCE | Maximize lower bound on MI(view1, view2) |
| Decision trees | Information gain | Maximize MI(split feature, label) |
| RLHF / DPO | Reward with KL penalty | max R(π) - β DKL(π || πref) |
| Language models | Next-token loss | Minimize per-token cross-entropy = log(PPL) |
The entire journey from loss functions to diffusion models is really about one thing: minimizing surprise. Classification minimizes the surprise when your model sees a label. Language models minimize surprise at the next token. VAEs minimize surprise at reconstructed data while keeping the latent space unsurprising. Even RLHF is about optimizing reward while not being too surprisingly different from the reference policy.
Shannon's 1948 insight — that information equals surprise, and that we can quantify surprise with logarithms — turned out to be the mathematical foundation of modern AI. Every cross-entropy loss you've computed, every perplexity score you've checked, every KL penalty you've applied is a direct descendant of that single, beautiful idea.
Code Block 6: The Unifying Table
# The information-theoretic view of machine learning
connections = [
("Classification", "Cross-entropy loss", "min D_KL(p_true || q_model)"),
("VAEs", "ELBO", "min H(p,q) + D_KL(q(z|x)||p(z))"),
("Diffusion Models", "Noise prediction", "min cross-entropy in noise space"),
("Contrastive (CL)", "InfoNCE", "max MI lower bound"),
("Decision Trees", "Information gain", "max MI(feature, label)"),
("RLHF / DPO", "Reward + KL penalty", "max R - beta * D_KL(pi||pi_ref)"),
("Language Models", "Next-token prediction", "min per-token H(p,q) = log(PPL)"),
]
print(f"{'Algorithm':<22} {'Loss/Objective':<24} {'Info-Theoretic Form'}")
print("-" * 78)
for algo, loss, info in connections:
print(f"{algo:<22} {loss:<24} {info}")
print("\nOne language. One framework. Every loss function is about")
print("minimizing surprise — the core insight of information theory.")
Try It: Entropy Explorer
Drag the sliders to set probabilities for each outcome. Probabilities auto-normalize to sum to 1. Watch how entropy changes — uniform = maximum entropy, concentrated = low entropy. The bar heights show optimal code lengths: likely events get short codes, rare events get long codes.
Try It: KL Divergence Visualizer
Adjust two Gaussian distributions P and Q. Watch how KL divergence changes — and notice the asymmetry: DKL(P||Q) ≠ DKL(Q||P). The shaded region shows where the divergence comes from.
References & Further Reading
- Claude Shannon — A Mathematical Theory of Communication (1948) — the paper that launched information theory and defined entropy, cross-entropy, and channel capacity
- Cover & Thomas — Elements of Information Theory (2006) — the standard graduate textbook covering entropy, KL divergence, mutual information, and rate-distortion theory
- David MacKay — Information Theory, Inference, and Learning Algorithms (2003) — freely available textbook bridging information theory and machine learning
- Goodfellow, Bengio & Courville — Deep Learning, Chapter 3 — probability and information theory foundations for deep learning
- Christopher Olah — Visual Information Theory — outstanding visual intuition for entropy, cross-entropy, and KL divergence
- DadOps — Loss Functions from Scratch — cross-entropy in the broader context of all loss functions
- DadOps — Decision Trees from Scratch — information gain as a splitting criterion
- DadOps — Autoencoders from Scratch — KL divergence in the VAE loss
- DadOps — RLHF from Scratch — KL penalty for policy optimization