← Back to Blog

Information Theory from Scratch: The Math Behind Every Loss Function

1. What Is Information? — Shannon's Surprise

Every time your neural network computes a loss, it's speaking a language invented in 1948 by a 32-year-old mathematician at Bell Labs. Cross-entropy, KL divergence, perplexity — these aren't arbitrary choices. They're theorems. And they all flow from a single, beautiful question: when someone tells you something, how much did you actually learn?

Claude Shannon answered this by connecting information to surprise. Consider two headlines:

The pattern: information is inversely related to probability. Events you expect carry little information. Events that shock you carry a lot. Shannon formalized this with a single equation:

I(x) = -log₂(p(x))

This gives information in bits. A fair coin flip has probability 0.5, so I = -log₂(0.5) = 1 bit. A fair six-sided die roll: I = -log₂(1/6) ≈ 2.58 bits. A loaded die where one face appears 90% of the time? That common face gives only 0.15 bits; the rare faces give much more. The negative log ensures that certain events (p=1) give zero bits, impossible events (p→0) give infinite bits, and everything else falls smoothly between.

But we rarely care about the information of a single event. We want the expected information over an entire distribution — the average surprise. This is entropy:

H(X) = -∑ p(x) log₂(p(x))

Entropy has two extremes. A uniform distribution (all outcomes equally likely) maximizes entropy — maximum uncertainty, maximum average surprise. A deterministic distribution (one outcome with probability 1) has zero entropy — no uncertainty, no surprise. If you've used entropy as a splitting criterion in decision trees, you've already seen this: information gain is the reduction in entropy after a split. And if you've adjusted softmax temperature, you've controlled the entropy of an output distribution — high temperature pushes toward uniform (high entropy), low temperature pushes toward deterministic (low entropy).

Code Block 1: Information Content and Entropy from Scratch

import numpy as np

def information_content(p):
    """Bits of information from an event with probability p."""
    return -np.log2(np.clip(p, 1e-12, 1.0))

def entropy(probs):
    """Shannon entropy H(X) in bits."""
    probs = np.array(probs, dtype=float)
    probs = probs[probs > 0]  # 0·log(0) = 0 by convention
    return -np.sum(probs * np.log2(probs))

# --- Examples ---
# Fair coin: maximum entropy for 2 outcomes
print(f"Fair coin entropy: {entropy([0.5, 0.5]):.4f} bits")        # 1.0
print(f"Biased coin (90/10): {entropy([0.9, 0.1]):.4f} bits")      # 0.469
print(f"Certain coin (100/0): {entropy([1.0, 0.0]):.4f} bits")     # 0.0

# Fair die: maximum entropy for 6 outcomes
print(f"Fair die entropy: {entropy([1/6]*6):.4f} bits")             # 2.585
print(f"Loaded die: {entropy([0.5,0.1,0.1,0.1,0.1,0.1]):.4f}")    # 2.161

# Binary entropy as a function of p
for p in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]:
    h = entropy([p, 1-p]) if p > 0 else 0.0
    print(f"  p={p:.1f}: H = {h:.4f} bits")
# Peaks at p=0.5 (1.0 bit) — the famous parabolic curve

2. Cross-Entropy — The Cost of Being Wrong

Entropy tells you the minimum bits needed to encode messages from a distribution if you know the true distribution. But what if you don't? What if the true distribution is p but your model thinks it's q? You'll build your encoding scheme around q, and because q is wrong, you'll waste bits. The average bits per message under this mismatch is cross-entropy:

H(p, q) = -∑ p(x) log q(x)

The key insight: cross-entropy is always at least as large as entropy. H(p, q) ≥ H(p), with equality only when q = p. The gap between them has a name — KL divergence (we'll get there in the next section):

H(p, q) - H(p) = DKL(p || q)

This is the "wasted bits" — the extra cost of using the wrong model. Now here's why this matters for machine learning: when you train a classifier, the true distribution p is fixed (it's determined by the labels). So minimizing cross-entropy H(p, q) with respect to q is exactly the same as minimizing KL divergence DKL(p || q). The entropy term H(p) is a constant that vanishes in the gradient.

This is why cross-entropy is the standard classification loss. It's not an arbitrary choice — it's the information-theoretically optimal way to make your model's distribution match the true distribution. If you've read loss functions from scratch, cross-entropy was introduced as a practical loss. Now you see the deeper reason: it minimizes wasted bits.

For binary classification with true label y ∈ {0, 1} and predicted probability ŷ:

H = -[y log(ŷ) + (1-y) log(1-ŷ)]

For multi-class with one-hot true labels yk and predicted probabilities ŷk:

H = -∑k yk log(ŷk)

Since only one yk = 1 and the rest are zero, this simplifies to -log(ŷc) where c is the correct class. That's the negative log-probability of the right answer — exactly Shannon's information content applied to your model's confidence.

Code Block 2: Cross-Entropy Loss Derived from Information Theory

import numpy as np

def cross_entropy(p, q):
    """Cross-entropy H(p, q) in bits."""
    p, q = np.array(p, dtype=float), np.array(q, dtype=float)
    q = np.clip(q, 1e-12, 1.0)
    mask = p > 0
    return -np.sum(p[mask] * np.log2(q[mask]))

def binary_cross_entropy(y_true, y_pred):
    """BCE for a single sample (in nats, as used in ML)."""
    y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
    return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_onehot, y_pred):
    """Multi-class cross-entropy for a single sample (in nats)."""
    y_pred = np.clip(y_pred, 1e-12, 1.0)
    return -np.sum(y_onehot * np.log(y_pred))

# --- Show cross-entropy >= entropy ---
p = [0.7, 0.2, 0.1]
q_good = [0.6, 0.3, 0.1]    # close to p
q_bad  = [0.1, 0.1, 0.8]    # very different from p

print(f"H(p)       = {cross_entropy(p, p):.4f} bits (entropy)")
print(f"H(p,q_good)= {cross_entropy(p, q_good):.4f} bits")
print(f"H(p,q_bad) = {cross_entropy(p, q_bad):.4f} bits")

# --- Binary cross-entropy example ---
# True label y=1, model predicts 0.9 (confident and correct)
print(f"\nBCE(y=1, pred=0.9) = {binary_cross_entropy(1, 0.9):.4f}")
# True label y=1, model predicts 0.1 (confident and WRONG)
print(f"BCE(y=1, pred=0.1) = {binary_cross_entropy(1, 0.1):.4f}")

# --- Multi-class example ---
y_true = [0, 1, 0]  # class 1 is correct
y_pred = [0.1, 0.8, 0.1]
print(f"\nCCE = {categorical_cross_entropy(y_true, y_pred):.4f}")
# This equals -log(0.8) — just the negative log of the correct class

3. KL Divergence — Measuring Distance Between Distributions

We've seen that cross-entropy measures the total bits needed when your model is wrong, and that the "wasted bits" portion is KL divergence. Let's formalize it:

DKL(p || q) = ∑ p(x) log(p(x) / q(x))

KL divergence has three important properties:

  1. Non-negative: DKL(p || q) ≥ 0 always (Gibbs' inequality). You can never waste negative bits.
  2. Zero iff identical: DKL(p || q) = 0 if and only if p = q everywhere.
  3. NOT symmetric: DKL(p || q) ≠ DKL(q || p) in general. This is crucial.

That asymmetry isn't a bug — it's a feature, and understanding it is one of the deepest practical insights in modern ML. Consider fitting a simple model q (say, a single Gaussian) to a complex distribution p (say, bimodal with two peaks):

This asymmetry shows up everywhere. VAEs minimize DKL(q(z|x) || p(z)) — forward KL that encourages the encoder to cover the prior. RLHF adds a KL penalty DKLnew || πref) to prevent the fine-tuned policy from straying too far from the reference model. And DPO implicitly optimizes the same KL-constrained objective but derives a closed-form solution. In every case, the direction of KL determines what the optimization prioritizes.

Code Block 3: KL Divergence and Its Two Directions

import numpy as np

def kl_divergence(p, q):
    """D_KL(p || q) in nats."""
    p, q = np.array(p, dtype=float), np.array(q, dtype=float)
    q = np.clip(q, 1e-12, 1.0)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

# --- Asymmetry demonstration ---
p = [0.4, 0.3, 0.2, 0.1]  # true distribution
q = [0.25, 0.25, 0.25, 0.25]  # uniform model

print(f"D_KL(p || q) = {kl_divergence(p, q):.4f} nats (forward)")
print(f"D_KL(q || p) = {kl_divergence(q, p):.4f} nats (reverse)")
print(f"They differ! KL is NOT symmetric.\n")

# --- Forward vs reverse KL: fitting to bimodal ---
# Bimodal target: mixture of two Gaussians
x = np.linspace(-6, 6, 1000)
dx = x[1] - x[0]
p_bimodal = 0.5 * np.exp(-0.5*(x+2)**2) + 0.5 * np.exp(-0.5*(x-2)**2)
p_bimodal /= p_bimodal.sum() * dx  # normalize

# Candidate Gaussians: centered between modes vs on one mode
q_cover = np.exp(-0.5 * (x/2.5)**2)  # wide, centered at 0
q_cover /= q_cover.sum() * dx
q_seek = np.exp(-0.5 * (x-2)**2)     # narrow, centered on right mode
q_seek /= q_seek.sum() * dx

p_d, qc_d, qs_d = p_bimodal * dx, q_cover * dx, q_seek * dx

fwd_cover = kl_divergence(p_d, qc_d)
fwd_seek  = kl_divergence(p_d, qs_d)
rev_cover = kl_divergence(qc_d, p_d)
rev_seek  = kl_divergence(qs_d, p_d)

print("Forward KL D_KL(p||q) — mode-covering wins:")
print(f"  q_cover: {fwd_cover:.4f}  q_seek: {fwd_seek:.4f}")
print("Reverse KL D_KL(q||p) — mode-seeking wins:")
print(f"  q_cover: {rev_cover:.4f}  q_seek: {rev_seek:.4f}")

4. Mutual Information — What Do Variables Share?

So far we've measured the information in a single variable (entropy) and the distance between two distributions (KL divergence). Mutual information answers a different question: how much does knowing one variable tell you about another?

I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Equivalently, it's the KL divergence between the joint distribution and the product of marginals:

I(X; Y) = DKL(p(x,y) || p(x)p(y))

The intuition: if X and Y are independent, then p(x,y) = p(x)p(y) and I(X; Y) = 0. Knowing Y tells you nothing about X. If Y is a deterministic function of X, then I(X; Y) = H(X). Knowing Y tells you everything.

This makes mutual information a natural criterion for feature selection: rank features by how much information they share with the target. Features with high MI are the ones that actually help predict the label. If you've built decision trees, you've already used this idea — information gain at each split is exactly the mutual information between the splitting feature (thresholded) and the target label.

Mutual information also appears in modern deep learning. Contrastive learning methods like SimCLR and CLIP maximize a lower bound on MI (the InfoNCE objective) between different views of the same data. Good embeddings are ones that preserve the MI between raw input and the downstream task — they keep the relevant information and discard the noise.

Code Block 4: Mutual Information and Feature Selection

import numpy as np

def mutual_information(x, y, bins=20):
    """Estimate MI between continuous x and discrete y using binning."""
    x_binned = np.digitize(x, np.linspace(x.min(), x.max(), bins))

    # Joint and marginal distributions
    joint = np.zeros((bins + 1, int(y.max()) + 1))
    for xi, yi in zip(x_binned, y.astype(int)):
        joint[xi, yi] += 1
    joint /= joint.sum()

    px = joint.sum(axis=1, keepdims=True)
    py = joint.sum(axis=0, keepdims=True)

    # MI = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
    mask = joint > 0
    independent = px * py
    mi = np.sum(joint[mask] * np.log(joint[mask] / independent[mask]))
    return mi

# --- Feature selection on synthetic data ---
rng = np.random.default_rng(42)
n = 2000

# Feature 0: strongly predictive of label
x0 = rng.normal(0, 1, n)
y = (x0 > 0).astype(int)  # label determined by x0

# Feature 1: weakly predictive (noisy copy)
x1 = x0 + rng.normal(0, 3, n)

# Feature 2: pure noise (independent of label)
x2 = rng.normal(0, 1, n)

features = [x0, x1, x2]
names = ["x0 (strong)", "x1 (weak)", "x2 (noise)"]

print("Mutual Information with target label:")
for name, feat in zip(names, features):
    mi = mutual_information(feat, y)
    print(f"  {name}: MI = {mi:.4f} nats")
# x0 has highest MI — it determines the label
# x2 has near-zero MI — it's independent noise

5. Perplexity — Evaluating Language Models

Every concept we've built leads to one of the most important metrics in natural language processing: perplexity. If your language model assigns cross-entropy H(p, q) to a test corpus, then perplexity is:

PPL = 2H(p, q)

Or equivalently (since ML uses natural log): PPL = ecross-entropy loss

The interpretation is elegant: a perplexity of 10 means the model is "as uncertain as if it were choosing uniformly among 10 options at each step." Lower perplexity means the model is more confident about correct predictions — it's less "perplexed" by the text.

The progression across model architectures is dramatic:

ModelPerplexityInterpretation
Random (uniform over vocab)~50,000Choosing randomly among all words
Unigram (word frequencies)~1,000Knows which words are common
Bigram~200Knows which word pairs occur
LSTM~60Captures long-range patterns
Transformer (GPT-2 scale)~20Deep contextual understanding

That's a 2,500× improvement from random to transformer — each architectural advance learns increasingly sophisticated patterns about language. And notice: perplexity and cross-entropy loss are the same thing on different scales. When you watch your language model's training loss decrease from 10.8 to 3.0, it's going from PPL = e10.8 ≈ 50,000 (random) to PPL = e3.0 ≈ 20 (transformer). Every reduction in loss is a reduction in the model's "surprise" at real language.

One important caveat, discussed in tokenization from scratch: you can't directly compare perplexity across models with different tokenizers. A model with a 32K vocabulary and a model with a 50K vocabulary are playing different games — more tokens means higher "baseline" perplexity. Always compare perplexity within the same tokenization scheme, or normalize to per-character or per-byte perplexity for fair comparisons. This distinction matters for decoding strategies too — perplexity measures model quality, while decoding strategy determines generation quality.

Code Block 5: Computing Perplexity for Language Models

import numpy as np

def perplexity_from_logprobs(log_probs):
    """Perplexity from per-token log-probabilities (base e)."""
    avg_nll = -np.mean(log_probs)
    return np.exp(avg_nll)

def perplexity_from_loss(cross_entropy_loss):
    """Perplexity from average cross-entropy loss."""
    return np.exp(cross_entropy_loss)

# --- Simulate three language models on the same 10-token sequence ---
vocab_size = 50000

# Model 1: Random (uniform over vocabulary)
random_logprobs = np.full(10, np.log(1.0 / vocab_size))
ppl_random = perplexity_from_logprobs(random_logprobs)

# Model 2: Bigram (moderately concentrated predictions)
bigram_logprobs = np.log([0.005, 0.01, 0.008, 0.02, 0.003,
                          0.015, 0.007, 0.012, 0.009, 0.006])
ppl_bigram = perplexity_from_logprobs(bigram_logprobs)

# Model 3: Transformer (highly concentrated predictions)
transformer_logprobs = np.log([0.15, 0.08, 0.12, 0.20, 0.05,
                               0.10, 0.07, 0.18, 0.09, 0.06])
ppl_transformer = perplexity_from_logprobs(transformer_logprobs)

print("Perplexity comparison:")
print(f"  Random:      PPL = {ppl_random:,.0f}")
print(f"  Bigram:      PPL = {ppl_bigram:,.0f}")
print(f"  Transformer: PPL = {ppl_transformer:.1f}")

# --- Connection: loss and perplexity are interchangeable ---
loss = 3.0
print(f"\nCross-entropy loss {loss:.1f} = PPL {perplexity_from_loss(loss):.1f}")
loss = 1.5
print(f"Cross-entropy loss {loss:.1f} = PPL {perplexity_from_loss(loss):.1f}")

6. Information Theory in Modern AI — The Big Connections

Here's the payoff. Nearly every algorithm we've studied in this series is optimizing an information-theoretic objective. They all speak the same language — the language of bits, surprise, and divergence. Here's the unifying view:

AlgorithmWhat It OptimizesInformation-Theoretic Form
Classification (any model)Cross-entropy lossMinimize DKL(ptrue || qmodel)
VAEsELBOMinimize reconstruction H(p,q) + KL regularization
Diffusion modelsNoise prediction lossMinimize cross-entropy in noise-prediction space
Contrastive learningInfoNCEMaximize lower bound on MI(view1, view2)
Decision treesInformation gainMaximize MI(split feature, label)
RLHF / DPOReward with KL penaltymax R(π) - β DKL(π || πref)
Language modelsNext-token lossMinimize per-token cross-entropy = log(PPL)

The entire journey from loss functions to diffusion models is really about one thing: minimizing surprise. Classification minimizes the surprise when your model sees a label. Language models minimize surprise at the next token. VAEs minimize surprise at reconstructed data while keeping the latent space unsurprising. Even RLHF is about optimizing reward while not being too surprisingly different from the reference policy.

Shannon's 1948 insight — that information equals surprise, and that we can quantify surprise with logarithms — turned out to be the mathematical foundation of modern AI. Every cross-entropy loss you've computed, every perplexity score you've checked, every KL penalty you've applied is a direct descendant of that single, beautiful idea.

Code Block 6: The Unifying Table

# The information-theoretic view of machine learning

connections = [
    ("Classification",     "Cross-entropy loss",     "min D_KL(p_true || q_model)"),
    ("VAEs",               "ELBO",                   "min H(p,q) + D_KL(q(z|x)||p(z))"),
    ("Diffusion Models",   "Noise prediction",       "min cross-entropy in noise space"),
    ("Contrastive (CL)",   "InfoNCE",                "max MI lower bound"),
    ("Decision Trees",     "Information gain",        "max MI(feature, label)"),
    ("RLHF / DPO",        "Reward + KL penalty",    "max R - beta * D_KL(pi||pi_ref)"),
    ("Language Models",    "Next-token prediction",  "min per-token H(p,q) = log(PPL)"),
]

print(f"{'Algorithm':<22} {'Loss/Objective':<24} {'Info-Theoretic Form'}")
print("-" * 78)
for algo, loss, info in connections:
    print(f"{algo:<22} {loss:<24} {info}")

print("\nOne language. One framework. Every loss function is about")
print("minimizing surprise — the core insight of information theory.")

Try It: Entropy Explorer

Drag the sliders to set probabilities for each outcome. Probabilities auto-normalize to sum to 1. Watch how entropy changes — uniform = maximum entropy, concentrated = low entropy. The bar heights show optimal code lengths: likely events get short codes, rare events get long codes.

Try It: KL Divergence Visualizer

Adjust two Gaussian distributions P and Q. Watch how KL divergence changes — and notice the asymmetry: DKL(P||Q) ≠ DKL(Q||P). The shaded region shows where the divergence comes from.

P distribution
Q distribution
KL divergence region

References & Further Reading