Information Theory from Scratch: The Math Behind Every Loss Function

February 27, 2026 · Elementary · 18 min read

1. What Is Information? — Shannon's Surprise

Every time your neural network computes a loss, it's speaking a language invented in 1948 by a 32-year-old mathematician at Bell Labs. Cross-entropy, KL divergence, perplexity — these aren't arbitrary choices. They're theorems. And they all flow from a single, beautiful question: when someone tells you something, how much did you actually learn?

Claude Shannon answered this by connecting information to surprise. Consider two headlines:

"The sun rose today" — you already knew this would happen. Zero surprise, zero information.
"A meteor hit the office" — you had no idea. Enormous surprise, enormous information.

The pattern: information is inversely related to probability. Events you expect carry little information. Events that shock you carry a lot. Shannon formalized this with a single equation:

I(x) = -log₂(p(x))

This gives information in bits. A fair coin flip has probability 0.5, so I = -log₂(0.5) = 1 bit. A fair six-sided die roll: I = -log₂(1/6) ≈ 2.58 bits. A loaded die where one face appears 90% of the time? That common face gives only 0.15 bits; the rare faces give much more. The negative log ensures that certain events (p=1) give zero bits, impossible events (p→0) give infinite bits, and everything else falls smoothly between.

But we rarely care about the information of a single event. We want the expected information over an entire distribution — the average surprise. This is entropy:

H(X) = -∑ p(x) log₂(p(x))

Entropy has two extremes. A uniform distribution (all outcomes equally likely) maximizes entropy — maximum uncertainty, maximum average surprise. A deterministic distribution (one outcome with probability 1) has zero entropy — no uncertainty, no surprise. If you've used entropy as a splitting criterion in decision trees, you've already seen this: information gain is the reduction in entropy after a split. And if you've adjusted softmax temperature, you've controlled the entropy of an output distribution — high temperature pushes toward uniform (high entropy), low temperature pushes toward deterministic (low entropy).

Code Block 1: Information Content and Entropy from Scratch

import numpy as np

def information_content(p):
    """Bits of information from an event with probability p."""
    return -np.log2(np.clip(p, 1e-12, 1.0))

def entropy(probs):
    """Shannon entropy H(X) in bits."""
    probs = np.array(probs, dtype=float)
    probs = probs[probs > 0]  # 0·log(0) = 0 by convention
    return -np.sum(probs * np.log2(probs))

# --- Examples ---
# Fair coin: maximum entropy for 2 outcomes
print(f"Fair coin entropy: {entropy([0.5, 0.5]):.4f} bits")        # 1.0
print(f"Biased coin (90/10): {entropy([0.9, 0.1]):.4f} bits")      # 0.469
print(f"Certain coin (100/0): {entropy([1.0, 0.0]):.4f} bits")     # 0.0

# Fair die: maximum entropy for 6 outcomes
print(f"Fair die entropy: {entropy([1/6]*6):.4f} bits")             # 2.585
print(f"Loaded die: {entropy([0.5,0.1,0.1,0.1,0.1,0.1]):.4f}")    # 2.161

# Binary entropy as a function of p
for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]:
    h = entropy([p, 1-p]) if p > 0 else 0.0
    print(f"  p={p:.1f}: H = {h:.4f} bits")
# Peaks at p=0.5 (1.0 bit) — the famous parabolic curve

2. Cross-Entropy — The Cost of Being Wrong

Entropy tells you the minimum bits needed to encode messages from a distribution if you know the true distribution. But what if you don't? What if the true distribution is p but your model thinks it's q? You'll build your encoding scheme around q, and because q is wrong, you'll waste bits. The average bits per message under this mismatch is cross-entropy:

H(p, q) = -∑ p(x) log q(x)

The key insight: cross-entropy is always at least as large as entropy. H(p, q) ≥ H(p), with equality only when q = p. The gap between them has a name — KL divergence (we'll get there in the next section):

H(p, q) - H(p) = D_KL(p || q)

This is the "wasted bits" — the extra cost of using the wrong model. Now here's why this matters for machine learning: when you train a classifier, the true distribution p is fixed (it's determined by the labels). So minimizing cross-entropy H(p, q) with respect to q is exactly the same as minimizing KL divergence D_KL(p || q). The entropy term H(p) is a constant that vanishes in the gradient.

This is why cross-entropy is the standard classification loss. It's not an arbitrary choice — it's the information-theoretically optimal way to make your model's distribution match the true distribution. If you've read loss functions from scratch, cross-entropy was introduced as a practical loss. Now you see the deeper reason: it minimizes wasted bits.

For binary classification with true label y ∈ {0, 1} and predicted probability ŷ:

H = -[y log(ŷ) + (1-y) log(1-ŷ)]

For multi-class with one-hot true labels y_k and predicted probabilities ŷ_k:

H = -∑_k y_k log(ŷ_k)

Since only one y_k = 1 and the rest are zero, this simplifies to -log(ŷ_c) where c is the correct class. That's the negative log-probability of the right answer — exactly Shannon's information content applied to your model's confidence.

Code Block 2: Cross-Entropy Loss Derived from Information Theory

import numpy as np

def cross_entropy(p, q):
    """Cross-entropy H(p, q) in bits."""
    p, q = np.array(p, dtype=float), np.array(q, dtype=float)
    q = np.clip(q, 1e-12, 1.0)
    mask = p > 0
    return -np.sum(p[mask] * np.log2(q[mask]))

def binary_cross_entropy(y_true, y_pred):
    """BCE for a single sample (in nats, as used in ML)."""
    y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
    return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_onehot, y_pred):
    """Multi-class cross-entropy for a single sample (in nats)."""
    y_pred = np.clip(y_pred, 1e-12, 1.0)
    return -np.sum(y_onehot * np.log(y_pred))

# --- Show cross-entropy >= entropy ---
p = [0.7, 0.2, 0.1]
q_good = [0.6, 0.3, 0.1]    # close to p
q_bad  = [0.1, 0.1, 0.8]    # very different from p

print(f"H(p)       = {cross_entropy(p, p):.4f} bits (entropy)")
print(f"H(p,q_good)= {cross_entropy(p, q_good):.4f} bits")
print(f"H(p,q_bad) = {cross_entropy(p, q_bad):.4f} bits")

# --- Binary cross-entropy example ---
# True label y=1, model predicts 0.9 (confident and correct)
print(f"\nBCE(y=1, pred=0.9) = {binary_cross_entropy(1, 0.9):.4f}")
# True label y=1, model predicts 0.1 (confident and WRONG)
print(f"BCE(y=1, pred=0.1) = {binary_cross_entropy(1, 0.1):.4f}")

# --- Multi-class example ---
y_true = [0, 1, 0]  # class 1 is correct
y_pred = [0.1, 0.8, 0.1]
print(f"\nCCE = {categorical_cross_entropy(y_true, y_pred):.4f}")
# This equals -log(0.8) — just the negative log of the correct class

3. KL Divergence — Measuring Distance Between Distributions

We've seen that cross-entropy measures the total bits needed when your model is wrong, and that the "wasted bits" portion is KL divergence. Let's formalize it:

D_KL(p || q) = ∑ p(x) log(p(x) / q(x))

KL divergence has three important properties:

Non-negative: D_KL(p || q) ≥ 0 always (Gibbs' inequality). You can never waste negative bits.
Zero iff identical: D_KL(p || q) = 0 if and only if p = q everywhere.
NOT symmetric: D_KL(p || q) ≠ D_KL(q || p) in general. This is crucial.

That asymmetry isn't a bug — it's a feature, and understanding it is one of the deepest practical insights in modern ML. Consider fitting a simple model q (say, a single Gaussian) to a complex distribution p (say, bimodal with two peaks):

Forward KL D_KL(p || q): penalizes q wherever p is high. Result: q spreads out to cover all modes of p, placing itself between the two peaks. This is mode-covering behavior.
Reverse KL D_KL(q || p): penalizes q wherever q is high but p is low. Result: q collapses onto one mode of p, avoiding regions where p is near zero. This is mode-seeking behavior.

This asymmetry shows up everywhere. VAEs minimize D_KL(q(z|x) || p(z)) — forward KL that encourages the encoder to cover the prior. RLHF adds a KL penalty D_KL(π_new || π_ref) to prevent the fine-tuned policy from straying too far from the reference model. And DPO implicitly optimizes the same KL-constrained objective but derives a closed-form solution. In every case, the direction of KL determines what the optimization prioritizes.

Code Block 3: KL Divergence and Its Two Directions

import numpy as np

def kl_divergence(p, q):
    """D_KL(p || q) in nats."""
    p, q = np.array(p, dtype=float), np.array(q, dtype=float)
    q = np.clip(q, 1e-12, 1.0)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

# --- Asymmetry demonstration ---
p = [0.4, 0.3, 0.2, 0.1]  # true distribution
q = [0.25, 0.25, 0.25, 0.25]  # uniform model

print(f"D_KL(p || q) = {kl_divergence(p, q):.4f} nats (forward)")
print(f"D_KL(q || p) = {kl_divergence(q, p):.4f} nats (reverse)")
print(f"They differ! KL is NOT symmetric.\n")

# --- Forward vs reverse KL: fitting to bimodal ---
# Bimodal target: mixture of two Gaussians
x = np.linspace(-6, 6, 1000)
dx = x[1] - x[0]
p_bimodal = 0.5 * np.exp(-0.5*(x+2)**2) + 0.5 * np.exp(-0.5*(x-2)**2)
p_bimodal /= p_bimodal.sum() * dx  # normalize

# Candidate Gaussians: centered between modes vs on one mode
q_cover = np.exp(-0.5 * (x/2.5)**2)  # wide, centered at 0
q_cover /= q_cover.sum() * dx
q_seek = np.exp(-0.5 * (x-2)**2)     # narrow, centered on right mode
q_seek /= q_seek.sum() * dx

p_d, qc_d, qs_d = p_bimodal * dx, q_cover * dx, q_seek * dx

fwd_cover = kl_divergence(p_d, qc_d)
fwd_seek  = kl_divergence(p_d, qs_d)
rev_cover = kl_divergence(qc_d, p_d)
rev_seek  = kl_divergence(qs_d, p_d)

print("Forward KL D_KL(p||q) — mode-covering wins:")
print(f"  q_cover: {fwd_cover:.4f}  q_seek: {fwd_seek:.4f}")
print("Reverse KL D_KL(q||p) — mode-seeking wins:")
print(f"  q_cover: {rev_cover:.4f}  q_seek: {rev_seek:.4f}")

4. Mutual Information — What Do Variables Share?

So far we've measured the information in a single variable (entropy) and the distance between two distributions (KL divergence). Mutual information answers a different question: how much does knowing one variable tell you about another?

I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Equivalently, it's the KL divergence between the joint distribution and the product of marginals:

I(X; Y) = D_KL(p(x,y) || p(x)p(y))

The intuition: if X and Y are independent, then p(x,y) = p(x)p(y) and I(X; Y) = 0. Knowing Y tells you nothing about X. If Y is a deterministic function of X, then I(X; Y) = H(X). Knowing Y tells you everything.

This makes mutual information a natural criterion for feature selection: rank features by how much information they share with the target. Features with high MI are the ones that actually help predict the label. If you've built decision trees, you've already used this idea — information gain at each split is exactly the mutual information between the splitting feature (thresholded) and the target label.

Mutual information also appears in modern deep learning. Contrastive learning methods like SimCLR and CLIP maximize a lower bound on MI (the InfoNCE objective) between different views of the same data. Good embeddings are ones that preserve the MI between raw input and the downstream task — they keep the relevant information and discard the noise.

Code Block 4: Mutual Information and Feature Selection

import numpy as np

def mutual_information(x, y, bins=20):
    """Estimate MI between continuous x and discrete y using binning."""
    x_binned = np.digitize(x, np.linspace(x.min(), x.max(), bins))

    # Joint and marginal distributions
    joint = np.zeros((bins + 1, int(y.max()) + 1))
    for xi, yi in zip(x_binned, y.astype(int)):
        joint[xi, yi] += 1
    joint /= joint.sum()

    px = joint.sum(axis=1, keepdims=True)
    py = joint.sum(axis=0, keepdims=True)

    # MI = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
    mask = joint > 0
    independent = px * py
    mi = np.sum(joint[mask] * np.log(joint[mask] / independent[mask]))
    return mi

# --- Feature selection on synthetic data ---
rng = np.random.default_rng(42)
n = 2000

# Feature 0: strongly predictive of label
x0 = rng.normal(0, 1, n)
y = (x0 > 0).astype(int)  # label determined by x0

# Feature 1: weakly predictive (noisy copy)
x1 = x0 + rng.normal(0, 3, n)

# Feature 2: pure noise (independent of label)
x2 = rng.normal(0, 1, n)

features = [x0, x1, x2]
names = ["x0 (strong)", "x1 (weak)", "x2 (noise)"]

print("Mutual Information with target label:")
for name, feat in zip(names, features):
    mi = mutual_information(feat, y)
    print(f"  {name}: MI = {mi:.4f} nats")
# x0 has highest MI — it determines the label
# x2 has near-zero MI — it's independent noise

5. Perplexity — Evaluating Language Models

Every concept we've built leads to one of the most important metrics in natural language processing: perplexity. If your language model assigns cross-entropy H(p, q) to a test corpus, then perplexity is:

PPL = 2^{H(p, q)}

Or equivalently (since ML uses natural log): PPL = e^{cross-entropy loss}

The interpretation is elegant: a perplexity of 10 means the model is "as uncertain as if it were choosing uniformly among 10 options at each step." Lower perplexity means the model is more confident about correct predictions — it's less "perplexed" by the text.

The progression across model architectures is dramatic:

Model	Perplexity	Interpretation
Random (uniform over vocab)	~50,000	Choosing randomly among all words
Unigram (word frequencies)	~1,000	Knows which words are common
Bigram	~200	Knows which word pairs occur
LSTM	~60	Captures long-range patterns
Transformer (GPT-2 scale)	~20	Deep contextual understanding

That's a 2,500× improvement from random to transformer — each architectural advance learns increasingly sophisticated patterns about language. And notice: perplexity and cross-entropy loss are the same thing on different scales. When you watch your language model's training loss decrease from 10.8 to 3.0, it's going from PPL = e^10.8 ≈ 50,000 (random) to PPL = e^3.0 ≈ 20 (transformer). Every reduction in loss is a reduction in the model's "surprise" at real language.

One important caveat, discussed in tokenization from scratch: you can't directly compare perplexity across models with different tokenizers. A model with a 32K vocabulary and a model with a 50K vocabulary are playing different games — more tokens means higher "baseline" perplexity. Always compare perplexity within the same tokenization scheme, or normalize to per-character or per-byte perplexity for fair comparisons. This distinction matters for decoding strategies too — perplexity measures model quality, while decoding strategy determines generation quality.

Code Block 5: Computing Perplexity for Language Models

import numpy as np

def perplexity_from_logprobs(log_probs):
    """Perplexity from per-token log-probabilities (base e)."""
    avg_nll = -np.mean(log_probs)
    return np.exp(avg_nll)

def perplexity_from_loss(cross_entropy_loss):
    """Perplexity from average cross-entropy loss."""
    return np.exp(cross_entropy_loss)

# --- Simulate three language models on the same 10-token sequence ---
vocab_size = 50000

# Model 1: Random (uniform over vocabulary)
random_logprobs = np.full(10, np.log(1.0 / vocab_size))
ppl_random = perplexity_from_logprobs(random_logprobs)

# Model 2: Bigram (moderately concentrated predictions)
bigram_logprobs = np.log([0.005, 0.01, 0.008, 0.02, 0.003,
                          0.015, 0.007, 0.012, 0.009, 0.006])
ppl_bigram = perplexity_from_logprobs(bigram_logprobs)

# Model 3: Transformer (highly concentrated predictions)
transformer_logprobs = np.log([0.15, 0.08, 0.12, 0.20, 0.05,
                               0.10, 0.07, 0.18, 0.09, 0.06])
ppl_transformer = perplexity_from_logprobs(transformer_logprobs)

print("Perplexity comparison:")
print(f"  Random:      PPL = {ppl_random:,.0f}")
print(f"  Bigram:      PPL = {ppl_bigram:,.0f}")
print(f"  Transformer: PPL = {ppl_transformer:.1f}")

# --- Connection: loss and perplexity are interchangeable ---
loss = 3.0
print(f"\nCross-entropy loss {loss:.1f} = PPL {perplexity_from_loss(loss):.1f}")
loss = 1.5
print(f"Cross-entropy loss {loss:.1f} = PPL {perplexity_from_loss(loss):.1f}")

6. Information Theory in Modern AI — The Big Connections

Here's the payoff. Nearly every algorithm we've studied in this series is optimizing an information-theoretic objective. They all speak the same language — the language of bits, surprise, and divergence. Here's the unifying view:

Algorithm	What It Optimizes	Information-Theoretic Form
Classification (any model)	Cross-entropy loss	Minimize D_KL(p_true \|\| q_model)
VAEs	ELBO	Minimize reconstruction H(p,q) + KL regularization
Diffusion models	Noise prediction loss	Minimize cross-entropy in noise-prediction space
Contrastive learning	InfoNCE	Maximize lower bound on MI(view1, view2)
Decision trees	Information gain	Maximize MI(split feature, label)
RLHF / DPO	Reward with KL penalty	max R(π) - β D_KL(π \|\| π_ref)
Language models	Next-token loss	Minimize per-token cross-entropy = log(PPL)

The entire journey from loss functions to diffusion models is really about one thing: minimizing surprise. Classification minimizes the surprise when your model sees a label. Language models minimize surprise at the next token. VAEs minimize surprise at reconstructed data while keeping the latent space unsurprising. Even RLHF is about optimizing reward while not being too surprisingly different from the reference policy.

Shannon's 1948 insight — that information equals surprise, and that we can quantify surprise with logarithms — turned out to be the mathematical foundation of modern AI. Every cross-entropy loss you've computed, every perplexity score you've checked, every KL penalty you've applied is a direct descendant of that single, beautiful idea.

Code Block 6: The Unifying Table

# The information-theoretic view of machine learning

connections = [
    ("Classification",     "Cross-entropy loss",     "min D_KL(p_true || q_model)"),
    ("VAEs",               "ELBO",                   "min H(p,q) + D_KL(q(z|x)||p(z))"),
    ("Diffusion Models",   "Noise prediction",       "min cross-entropy in noise space"),
    ("Contrastive (CL)",   "InfoNCE",                "max MI lower bound"),
    ("Decision Trees",     "Information gain",        "max MI(feature, label)"),
    ("RLHF / DPO",        "Reward + KL penalty",    "max R - beta * D_KL(pi||pi_ref)"),
    ("Language Models",    "Next-token prediction",  "min per-token H(p,q) = log(PPL)"),
]

print(f"{'Algorithm':<22} {'Loss/Objective':<24} {'Info-Theoretic Form'}")
print("-" * 78)
for algo, loss, info in connections:
    print(f"{algo:<22} {loss:<24} {info}")

print("\nOne language. One framework. Every loss function is about")
print("minimizing surprise — the core insight of information theory.")

Try It: Entropy Explorer

Drag the sliders to set probabilities for each outcome. Probabilities auto-normalize to sum to 1. Watch how entropy changes — uniform = maximum entropy, concentrated = low entropy. The bar heights show optimal code lengths: likely events get short codes, rare events get long codes.

Categories: Scale:

Try It: KL Divergence Visualizer

Adjust two Gaussian distributions P and Q. Watch how KL divergence changes — and notice the asymmetry: D_KL(P||Q) ≠ D_KL(Q||P). The shaded region shows where the divergence comes from.

P mean: 0.0 P σ: 1.0 Q mean: 1.5 Q σ: 1.8

Shade:

P distribution

Q distribution

KL divergence region

References & Further Reading

Claude Shannon — A Mathematical Theory of Communication (1948) — the paper that launched information theory and defined entropy, cross-entropy, and channel capacity
Cover & Thomas — Elements of Information Theory (2006) — the standard graduate textbook covering entropy, KL divergence, mutual information, and rate-distortion theory
David MacKay — Information Theory, Inference, and Learning Algorithms (2003) — freely available textbook bridging information theory and machine learning
Goodfellow, Bengio & Courville — Deep Learning, Chapter 3 — probability and information theory foundations for deep learning
Christopher Olah — Visual Information Theory — outstanding visual intuition for entropy, cross-entropy, and KL divergence
DadOps — Loss Functions from Scratch — cross-entropy in the broader context of all loss functions
DadOps — Decision Trees from Scratch — information gain as a splitting criterion
DadOps — Autoencoders from Scratch — KL divergence in the VAE loss
DadOps — RLHF from Scratch — KL penalty for policy optimization