Loss Functions from Scratch

February 25, 2026 · Elementary AI · 16 min read

The Error Signal

Every neural network you’ve ever used learned by being wrong. Not a little wrong — catastrophically, embarrassingly wrong at first. A freshly initialized network looking at a photo of a cat might declare with 99.8% confidence that it’s a forklift. What turns that confused pile of random weights into something useful isn’t clever architecture or massive data. It’s the loss function — the single number that tells the model exactly how badly it messed up.

In our micrograd post, we built an autograd engine and trained a tiny neural network. At the top of the computation graph sat a loss value with gradient = 1.0, driving all of backpropagation. But we used a loss function — mean squared error — without explaining why that formula. In our embeddings post, we trained Word2Vec with a different loss — binary cross-entropy — and didn’t explain why that one either. And in the softmax post, we built the function that produces probabilities but never talked about how to measure whether those probabilities are any good.

Today we fill that gap. We’ll build three loss functions from scratch — mean squared error, binary cross-entropy, and categorical cross-entropy — see why cross-entropy dominates classification, and discover the elegant gradient cancellation that makes it all work. Along the way, we’ll take a detour through information theory to understand why it’s called “cross-entropy” in the first place.

The punchline is surprising: choosing the right loss function isn’t just about measuring error differently. Different loss functions produce different gradients, and the shape of those gradients determines whether your model learns fast or barely learns at all.

Starting Simple: Mean Squared Error

Let’s start where most textbooks start: regression. You have a model that predicts a number, and you want to measure how far off those predictions are from the truth. The most intuitive approach: take the difference, square it, and average.

import numpy as np

def mse_loss(predictions, targets):
    """Mean squared error: average of squared residuals."""
    return np.mean((predictions - targets) ** 2)

# A model predicting house prices (in $100k)
predictions = np.array([2.5, 0.0, 2.1, 7.8])
targets     = np.array([3.0, -0.5, 2.0, 7.5])

loss = mse_loss(predictions, targets)
print(f"MSE: {loss:.4f}")
# MSE: 0.1500

Why square the errors? Three good reasons. First, squaring makes everything positive — an error of −0.5 and +0.5 are equally bad. Second, squaring punishes large errors disproportionately: an error of 2 costs four times as much as an error of 1. Third, the squared function is smooth and differentiable everywhere, which gradient descent loves.

The gradient is equally simple. For a single prediction:

# Gradient of MSE w.r.t. a single prediction
# ∂MSE/∂ŷ = (2/n)(ŷ - y)

def mse_gradient(prediction, target, n):
    """Gradient of MSE with respect to prediction."""
    return (2 / n) * (prediction - target)

# If we predicted 2.5 but the target was 3.0:
grad = mse_gradient(2.5, 3.0, n=4)
print(f"Gradient: {grad:.4f}")
# Gradient: -0.2500
# Negative → push the prediction UP toward 3.0. Makes sense!

The gradient points in the right direction and is proportional to the size of the error. For regression, this is exactly what we want. This is exactly the loss that drove training in our micrograd post: loss = sum((pred - target) ** 2 for ...).

But what happens when we leave regression behind and try to classify things?

When Squared Error Meets Classification

Binary classification: the model should output a probability between 0 and 1. Is this email spam or not? Is this pixel part of a cat or not? We squeeze the model’s raw output through a sigmoid function to get a probability, then compare it to the true label (0 or 1).

def sigmoid(z):
    """Squash any real number into (0, 1)."""
    return 1 / (1 + np.exp(-z))

# Model's raw output (logit) and true label
z = -4.5     # model's raw score
y = 1        # true label: this IS spam

p = sigmoid(z)  # predicted probability
print(f"Prediction: {p:.6f}")
# Prediction: 0.010987
# Model says 1.1% chance of spam. True label is 1. It's VERY wrong.

The model is 99% confident this email isn’t spam, but it actually is. Let’s try using MSE to punish this mistake:

# MSE on the probability
mse = (p - y) ** 2
print(f"MSE loss: {mse:.6f}")
# MSE loss: 0.978147
# Loss is high (close to 1). Good — the model is very wrong.

The loss is high, which seems right. But the real question isn’t about the loss value — it’s about the gradient. How hard will this loss push the model to fix its mistake?

# To update the model's weights, we need ∂Loss/∂z (gradient w.r.t. the logit)
# Chain rule: ∂MSE/∂z = ∂MSE/∂p · ∂p/∂z
#   ∂MSE/∂p = 2(p - y)
#   ∂p/∂z = sigmoid'(z) = p(1 - p)
# So: ∂MSE/∂z = 2(p - y) · p(1 - p)

def sigmoid_derivative(p):
    return p * (1 - p)

def mse_grad_wrt_logit(p, y):
    """Gradient of MSE w.r.t. the pre-sigmoid logit z."""
    return 2 * (p - y) * sigmoid_derivative(p)

grad = mse_grad_wrt_logit(p, y)
print(f"Prediction: {p:.4f} | Target: {y} | MSE gradient: {grad:.6f}")
# Prediction: 0.0110 | Target: 1 | MSE gradient: -0.021494

Look at that gradient: −0.021. The model is almost maximally wrong — predicting 1% when the answer is 100% — but the gradient is tiny. Why?

Because of sigmoid_derivative(p). The sigmoid derivative is p(1 − p), which is largest at p = 0.5 and vanishes as p approaches 0 or 1. When the model is confidently wrong (p ≈ 0.01), the sigmoid is in its flat region, and the gradient nearly disappears.

It’s like a teacher who speaks up when you’re slightly confused but goes silent when you’re catastrophically wrong — exactly backwards from what learning needs.

Let’s see this across the full range:

Prediction (p)	Target (y)	σ′(z) = p(1−p)	MSE Gradient	Verdict
0.01	1	0.0099	−0.0196	Almost no push!
0.10	1	0.0900	−0.1620	Weak push
0.50	1	0.2500	−0.2500	Moderate push
0.90	1	0.0900	−0.0180	Small push (nearly right)
0.99	1	0.0099	−0.0002	Tiny push (almost perfect)

The strongest gradient is at p = 0.5 — when the model is merely uncertain, not when it’s confidently wrong. A model stuck at p = 0.01 for a positive example would take forever to escape, because the gradient pushing it out is 12× weaker than the gradient at p = 0.5.

We need a loss function whose gradient doesn’t get strangled by the sigmoid derivative. We need cross-entropy.

Cross-Entropy: The Logarithmic Fix

There are two ways to arrive at binary cross-entropy. Both land at the same formula, but each reveals a different piece of the intuition.

Path 1: Maximum Likelihood

Instead of “minimize the distance from the target,” think: “maximize the probability of observing the correct label.” If the model outputs p as the probability of class 1, the likelihood of observing the true label y is:

# Likelihood = p^y · (1-p)^(1-y)
# When y=1: Likelihood = p      (we want p to be high)
# When y=0: Likelihood = 1-p    (we want p to be low)

# Take the log (turns products into sums, easier to optimize):
# log-likelihood = y·log(p) + (1-y)·log(1-p)

# Negate it (because we minimize loss, not maximize likelihood):
# BCE = -[y·log(p) + (1-y)·log(1-p)]

Path 2: The Intuitive View

Forget the derivation — just look at what each piece does. When the true label is 1:

# When y=1: BCE = -log(p)
for p in [0.99, 0.9, 0.5, 0.1, 0.01]:
    print(f"  p={p:.2f} → loss = {-np.log(p):.4f}")
#   p=0.99 → loss = 0.0101     ← confident and RIGHT: tiny penalty
#   p=0.90 → loss = 0.1054     ← mostly right: small penalty
#   p=0.50 → loss = 0.6931     ← coin flip: moderate penalty
#   p=0.10 → loss = 2.3026     ← mostly wrong: large penalty
#   p=0.01 → loss = 4.6052     ← confident and WRONG: massive penalty

The −log curve shoots up toward infinity as p approaches zero. A model that assigns near-zero probability to the correct class pays an enormous price. This is exactly the asymmetry we need: infinite punishment for confident mistakes, near-zero punishment for confident correctness.

Here’s the implementation:

def binary_cross_entropy(p, y):
    """Binary cross-entropy loss for a single example."""
    epsilon = 1e-15  # numerical safety: avoid log(0)
    p = np.clip(p, epsilon, 1 - epsilon)
    return -(y * np.log(p) + (1 - y) * np.log(1 - p))

# Same scenario as before: p=0.011, y=1
loss = binary_cross_entropy(0.011, 1)
print(f"BCE loss: {loss:.4f}")
# BCE loss: 4.5099
# Much higher than MSE's 0.978 — cross-entropy screams louder when wrong

The Gradient Magic

Now the gradient. This is where the real magic happens.

# ∂BCE/∂z via chain rule:
#   ∂BCE/∂p = -y/p + (1-y)/(1-p)
#   ∂p/∂z   = p(1-p)            ← the sigmoid derivative
#
# Multiply them:
#   ∂BCE/∂z = [-y/p + (1-y)/(1-p)] · p(1-p)
#           = -y(1-p) + (1-y)p
#           = -y + yp + p - yp
#           = p - y
#
# THE SIGMOID DERIVATIVE CANCELS OUT.

def bce_grad_wrt_logit(p, y):
    """Gradient of BCE w.r.t. the pre-sigmoid logit z."""
    return p - y  # That's it. No sigmoid derivative anywhere.

# Model predicts p=0.011 for a positive example (y=1):
grad_bce = bce_grad_wrt_logit(0.011, 1)
grad_mse = mse_grad_wrt_logit(0.011, 1)

print(f"BCE gradient: {grad_bce:.4f}")
print(f"MSE gradient: {grad_mse:.4f}")
# BCE gradient: -0.9890     ← STRONG push! Nearly -1
# MSE gradient: -0.0215     ← Feeble whisper

Read that again: the sigmoid derivative vanishes from the gradient entirely. The log in cross-entropy was born to cancel the exp in sigmoid. What’s left is beautifully simple: p − y. Predicted minus actual. When the model is confident and wrong, the gradient is close to ±1 — a full-strength correction signal. When the model is confident and right, the gradient is close to 0 — leave it alone.

This is exactly the loss we used in our embeddings post to train Word2Vec with negative sampling. Now you know why it works so well.

The Information Theory Connection

Cross-entropy isn’t just a clever hack to fix vanishing gradients. It has a deep theoretical justification that gives it its name. A quick detour into information theory reveals why minimizing cross-entropy is the mathematically optimal objective for classification.

Surprise and Entropy

Information theory starts with a simple idea: surprising events carry more information. If I tell you the sun rose this morning, you learn nothing — that was going to happen anyway. But if I tell you it snowed in July in Miami, that’s genuinely informative. The “surprise” of an event with probability p is −log(p):

# Surprise = -log(p)
# Using natural log (nats), but log₂ gives bits

print(f"Sun rises (p=0.999):  surprise = {-np.log(0.999):.4f} nats")
print(f"Snow in Miami (p=0.001): surprise = {-np.log(0.001):.4f} nats")
# Sun rises (p=0.999):  surprise = 0.0010 nats
# Snow in Miami (p=0.001): surprise = 6.9078 nats

Entropy is the average surprise of a probability distribution — how unpredictable things are overall:

def entropy(p_dist):
    """Shannon entropy of a probability distribution (in nats)."""
    p_dist = np.array(p_dist)
    p_dist = p_dist[p_dist > 0]  # skip zeros (0·log(0) = 0 by convention)
    return -np.sum(p_dist * np.log(p_dist))

# Fair coin: maximum uncertainty
print(f"Fair coin [0.5, 0.5]:    H = {entropy([0.5, 0.5]):.4f} nats")
# Fair coin [0.5, 0.5]:    H = 0.6931 nats

# Loaded coin: mostly predictable
print(f"Loaded coin [0.99, 0.01]: H = {entropy([0.99, 0.01]):.4f} nats")
# Loaded coin [0.99, 0.01]: H = 0.0560 nats

# Certain outcome: no surprise at all
print(f"Sure thing [1.0, 0.0]:   H = {entropy([1.0, 0.0]):.4f} nats")
# Sure thing [1.0, 0.0]:   H = 0.0000 nats

Cross-Entropy: Surprise Under the Wrong Model

Now the key idea. Suppose reality follows distribution P (the true labels), but your model uses distribution Q (its predictions) to anticipate events. Cross-entropy measures your average surprise when using Q to predict events that actually follow P:

def cross_entropy(p_true, q_predicted):
    """Cross-entropy H(P, Q): surprise of Q under P's reality."""
    p = np.array(p_true)
    q = np.array(q_predicted)
    q = np.clip(q, 1e-15, 1.0)  # numerical safety
    return -np.sum(p * np.log(q))

# True distribution: 90% sunny, 10% rainy
P = [0.9, 0.1]

# Model A: good forecast (80% sunny, 20% rainy)
Q_good = [0.8, 0.2]

# Model B: bad forecast (50% sunny, 50% rainy)
Q_bad = [0.5, 0.5]

print(f"Entropy of reality H(P):     {entropy(P):.4f} nats")
print(f"Cross-entropy H(P, Q_good):  {cross_entropy(P, Q_good):.4f} nats")
print(f"Cross-entropy H(P, Q_bad):   {cross_entropy(P, Q_bad):.4f} nats")
# Entropy of reality H(P):     0.3251 nats
# Cross-entropy H(P, Q_good):  0.3618 nats  ← close to H(P), a good model!
# Cross-entropy H(P, Q_bad):   0.6931 nats  ← far from H(P), a bad model!

Cross-entropy is always ≥ entropy. The gap between them is KL divergence — the “wasted bits” from using the wrong distribution:

def kl_divergence(p_true, q_predicted):
    """KL(P || Q) = H(P, Q) - H(P)"""
    return cross_entropy(p_true, q_predicted) - entropy(p_true)

print(f"KL(P || Q_good): {kl_divergence(P, Q_good):.4f} nats")
print(f"KL(P || Q_bad):  {kl_divergence(P, Q_bad):.4f} nats")
# KL(P || Q_good): 0.0367 nats  ← small gap, good model
# KL(P || Q_bad):  0.3681 nats  ← large gap, bad model

Since H(P) is a constant (it depends only on reality, not on the model), minimizing cross-entropy H(P, Q) is equivalent to minimizing KL divergence. In other words: training with cross-entropy loss teaches the model’s distribution to match reality as closely as possible. That’s not a hack — it’s information-theoretically optimal.

Categorical Cross-Entropy: The Softmax Partnership

Binary cross-entropy handles two classes. For multi-class problems — image classification with 1,000 categories, next-token prediction with 100,000+ vocabulary tokens — we extend to categorical cross-entropy.

The setup: the model runs its logits through softmax to produce a probability distribution over K classes. The true label is one-hot encoded: a vector of zeros with a single 1 at the correct class.

def softmax(logits):
    """Numerically stable softmax (subtract-max trick from our softmax post)."""
    shifted = logits - np.max(logits)
    exps = np.exp(shifted)
    return exps / np.sum(exps)

def categorical_cross_entropy(y_true, y_pred):
    """Categorical cross-entropy for one-hot targets."""
    y_pred = np.clip(y_pred, 1e-15, 1.0)
    return -np.sum(y_true * np.log(y_pred))

# 5 classes. True class is index 2.
logits = np.array([2.0, 1.0, 0.5, -1.0, 0.1])
y_true = np.array([0, 0, 1, 0, 0])  # one-hot: class 2

probs = softmax(logits)
loss = categorical_cross_entropy(y_true, probs)

print(f"Probabilities: {np.round(probs, 4)}")
print(f"P(correct class): {probs[2]:.4f}")
print(f"Loss: {loss:.4f}")
# Probabilities: [0.5585 0.2055 0.1246 0.0278 0.0835]
# P(correct class): 0.1246
# Loss: 2.0824

Notice something elegant: since y_true is one-hot, all terms where y_k = 0 vanish. The loss simplifies to just −log(p_c) where c is the index of the true class. You only pay for the probability you assigned to the correct answer.

The Beautiful Gradient

Here’s where the partnership between softmax and cross-entropy shines. The gradient of categorical cross-entropy with respect to each logit z_k is:

# The gradient of softmax + categorical cross-entropy:
#   ∂L/∂z_k = ŷ_k - y_k   (for ALL k simultaneously)
#
# Predicted probability minus true label. That's it.

def softmax_ce_gradient(probs, y_true):
    """Gradient of (softmax + cross-entropy) w.r.t. logits."""
    return probs - y_true

grad = softmax_ce_gradient(probs, y_true)
print("Gradients per class:")
for k in range(len(grad)):
    marker = " ← true class" if y_true[k] == 1 else ""
    print(f"  class {k}: ŷ={probs[k]:.4f}, y={y_true[k]}, grad={grad[k]:+.4f}{marker}")
# Gradients per class:
#   class 0: ŷ=0.5585, y=0, grad=+0.5585
#   class 1: ŷ=0.2055, y=0, grad=+0.2055
#   class 2: ŷ=0.1246, y=1, grad=-0.8754 ← true class
#   class 3: ŷ=0.0278, y=0, grad=+0.0278
#   class 4: ŷ=0.0835, y=0, grad=+0.0835

This is the same cancellation trick we saw in binary cross-entropy, generalized to K classes. The exponentials in softmax and the logarithm in cross-entropy annihilate each other, leaving a gradient that is exactly: predicted minus actual. Simple, strong, and well-behaved everywhere.

Softmax converts logits to probabilities. Cross-entropy measures how bad those probabilities are. Together, their gradient is just the gap between prediction and reality. This is why they appear as an inseparable pair in virtually every classifier, from MNIST digit recognition to GPT’s next-token prediction.

For numerically stable computation during training, you can combine the two into a single log-softmax operation that avoids computing log(exp(...)):

def log_softmax(logits):
    """Log-softmax: log(softmax(z)) computed stably."""
    shifted = logits - np.max(logits)
    return shifted - np.log(np.sum(np.exp(shifted)))

def stable_categorical_ce(logits, class_index):
    """Categorical CE using log-softmax (numerically stable)."""
    return -log_softmax(logits)[class_index]

# Same example, stable computation
loss_stable = stable_categorical_ce(logits, class_index=2)
print(f"Stable loss: {loss_stable:.4f}")
# Stable loss: 2.0824  ← same answer, but won't overflow with large logits

The Gradient Showdown

Let’s put it all together. The central argument of this post is that loss functions don’t just measure error differently — they push differently. Here’s the comparison that makes it concrete.

For binary classification with a sigmoid output, the gradient with respect to the pre-sigmoid logit z:

predictions = [0.01, 0.05, 0.20, 0.50, 0.80, 0.95, 0.99]
y = 1  # true label

print(f"{'p':>6} | {'MSE grad':>10} | {'BCE grad':>10} | {'Ratio':>8}")
print("-" * 44)
for p in predictions:
    g_mse = 2 * (p - y) * p * (1 - p)   # MSE gradient (includes σ' term)
    g_bce = p - y                         # BCE gradient (clean!)
    ratio = abs(g_bce / g_mse) if abs(g_mse) > 1e-10 else float('inf')
    print(f"{p:6.2f} | {g_mse:+10.6f} | {g_bce:+10.4f} | {ratio:7.1f}x")

#      p |   MSE grad |   BCE grad |    Ratio
# --------------------------------------------
#   0.01 |  -0.019602 |    -0.9900 |    50.5x
#   0.05 |  -0.090250 |    -0.9500 |    10.5x
#   0.20 |  -0.256000 |    -0.8000 |     3.1x
#   0.50 |  -0.250000 |    -0.5000 |     2.0x
#   0.80 |  -0.064000 |    -0.2000 |     3.1x
#   0.95 |  -0.004750 |    -0.0500 |    10.5x
#   0.99 |  -0.000198 |    -0.0100 |    50.5x

At p = 0.01 (model is 99% wrong), BCE’s gradient is 50× stronger than MSE’s. That’s not a minor improvement — it’s the difference between a model that learns in 10 epochs and one that stalls for hundreds.

The pattern is clear:

MSE gradient depends on σ'(z) = p(1−p), which peaks at p = 0.5 and vanishes at the extremes
BCE gradient is just p − y — strongest when the model is most wrong, weakest when it’s most right

This is exactly the gradient schedule you’d design by hand if you could: push hard when the model is wrong, ease off when it’s right. Cross-entropy gives you this for free.

Try It: Loss Function Explorer

Move the slider to change the model’s predicted probability. Toggle the true label. Watch how MSE (blue) and cross-entropy (red) respond — especially in the “confidently wrong” zone.

Prediction (p): 0.15 True label:

What We Didn’t Cover

Loss functions are a vast landscape. Here are threads worth pulling:

Focal loss — Down-weights easy examples so the model focuses on hard ones. Essential for object detection where 99% of candidate boxes are “background.”
Label smoothing — Replaces the hard one-hot target [0, 0, 1, 0] with a soft target like [0.025, 0.025, 0.925, 0.025]. Prevents overconfidence and improves generalization.
Contrastive loss & triplet loss — Used to train embeddings by pulling similar items together and pushing dissimilar items apart. A different goal than classification: learning a space, not a decision.
Hinge loss — The loss behind SVMs. Doesn’t care about probability — only cares that the correct class scores higher than the incorrect class by a margin.
Huber loss — A hybrid of MSE and MAE (mean absolute error) for regression. Squares small errors but is linear for large ones, making it robust to outliers.
KL divergence as a loss — Used in knowledge distillation, where a small model learns to mimic a large model’s output distribution.

Each of these has its own gradient story, and now you have the tools to analyze any of them. The question is always the same: what gradients does this loss produce, and do they point in the right direction with the right magnitude?

References & Further Reading

Michael Nielsen — The Cross-Entropy Cost Function — Chapter 3 of Neural Networks and Deep Learning, with a clear derivation of why cross-entropy fixes the learning slowdown
Chris Olah — Visual Information Theory — A beautiful visual guide to entropy, cross-entropy, and KL divergence
Goodfellow, Bengio & Courville — Deep Learning, Chapter 6 — Formal treatment of loss functions and output units, including the log-likelihood connection
Stanford CS231n — Loss Functions — Lecture notes comparing SVM loss, softmax loss, and their gradients
Wikipedia — Cross-Entropy — Comprehensive reference with connections to information theory, statistics, and machine learning
DadOps — Micrograd from Scratch — Our autograd post where MSE drove backpropagation
DadOps — Softmax & Temperature from Scratch — The softmax function that produces the probabilities cross-entropy evaluates
DadOps — Embeddings from Scratch — Training Word2Vec with binary cross-entropy and negative sampling