← Back to Blog

Energy-Based Models from Scratch

1. The Energy-Probability Connection

Every probability distribution can be written as p(x) = exp(−E(x)) / Z. This isn’t just a mathematical trick — it’s a complete worldview. Instead of defining probabilities directly (which requires enforcing that everything sums to one — hard), you define an energy function E(x) that assigns a scalar to each configuration, and let the Boltzmann distribution handle the rest.

The physics analogy is immediate: a ball rolling on a landscape settles into valleys. Low-energy states are stable and likely. High-energy states are unstable and rare. Ludwig Boltzmann formalized this in the 1870s for thermodynamic systems, and the same mathematics governs everything from protein folding to modern generative AI.

The distribution p(x) = exp(−E(x)) / Z is called the Gibbs distribution (or Boltzmann distribution). The numerator exp(−E(x)) is easy to evaluate — just plug x into your energy function and exponentiate. The denominator Z = ∫ exp(−E(x)) dx is the partition function, a normalizing constant that sums over every possible configuration of x. For any interesting model, this integral is intractable. You can’t compute it analytically, and numerical integration fails in high dimensions.

This is the central tension of energy-based models: defining the energy is easy, but working with the resulting probability distribution is hard. The entire history of EBMs — from Hopfield networks in 1982 to score-based diffusion models in 2019 — is a series of increasingly clever ways to learn and sample from p(x) without ever computing Z.

Let’s start by building a simple energy landscape and seeing what it implies about probability:

import numpy as np

def energy(x, y, centers, heights, widths):
    """2D energy landscape: sum of inverted Gaussians (wells)."""
    E = 0.0
    for (cx, cy), h, w in zip(centers, heights, widths):
        E -= h * np.exp(-((x - cx)**2 + (y - cy)**2) / (2 * w**2))
    return E

# Three energy wells at different locations
centers = [(-2, -1), (2, 1), (0, 3)]
heights = [3.0, 2.0, 2.5]
widths  = [0.8, 1.0, 0.7]

# Evaluate on a grid
xs = np.linspace(-5, 5, 200)
ys = np.linspace(-3, 5, 200)
X, Y = np.meshgrid(xs, ys)
E = energy(X, Y, centers, heights, widths)

# Unnormalized probability: p(x) proportional to exp(-E(x))
unnorm_p = np.exp(-E)

# The partition function Z would be the integral of unnorm_p over all space
# We can approximate it on our grid:
dx = xs[1] - xs[0]
dy = ys[1] - ys[0]
Z_approx = np.sum(unnorm_p) * dx * dy
print(f"Approximate Z = {Z_approx:.2f}")
# Z_approx ~ 104.53 — grows with domain size and dimensionality

The energy function defines three “wells” — regions of low energy. When we exponentiate the negative energy, those wells become peaks of high probability. The deepest well (height 3.0 at (−2, −1)) becomes the tallest probability peak. Notice that we can approximate Z on a 2D grid, but in 1000 dimensions? Forget it — that’s 2001000 grid points.

2. Hopfield Networks — Energy Minimization as Memory

Before energy-based models became a framework for generative AI, John Hopfield showed in 1982 that neural networks could be understood through energy functions. A Hopfield network is a recurrent network that stores patterns as low-energy attractors — dips in an energy landscape that the network dynamics naturally fall into.

The setup is elegant: N binary neurons with states si ∈ {−1, +1}, connected by symmetric weights wij = wji with no self-connections. The energy function is:

E = −½ Σi,j wij si sj

The update rule is simple: pick a random neuron i, compute its local field hi = Σj wij sj, and set si = sign(hi). Each such flip is guaranteed to decrease (or maintain) the energy — making E a Lyapunov function. The network always converges to a local minimum.

To store patterns, we use Hebbian learning: W = (1/N) Σk p(k) (p(k))T, where each p(k) is a pattern to memorize. This sets the weights so that each stored pattern is a local energy minimum. When you initialize the network with a corrupted version of a stored pattern and run the dynamics, it flows downhill in energy toward the nearest stored pattern — associative memory through energy minimization.

import numpy as np

class HopfieldNetwork:
    def __init__(self, n_neurons):
        self.n = n_neurons
        self.W = np.zeros((n_neurons, n_neurons))

    def store(self, patterns):
        """Store patterns using Hebbian learning."""
        self.W = np.zeros((self.n, self.n))
        for p in patterns:
            self.W += np.outer(p, p) / self.n
        np.fill_diagonal(self.W, 0)  # no self-connections

    def energy(self, state):
        return -0.5 * state @ self.W @ state

    def recall(self, probe, max_steps=20):
        """Run async dynamics until convergence."""
        state = probe.copy()
        for step in range(max_steps):
            changed = False
            order = np.random.permutation(self.n)
            for i in order:
                h_i = self.W[i] @ state
                new_si = 1 if h_i >= 0 else -1
                if new_si != state[i]:
                    state[i] = new_si
                    changed = True
            if not changed:
                break
        return state

# Store 3 patterns in a 64-neuron network (8x8 grid)
net = HopfieldNetwork(64)
p1 = np.ones(64) * -1; p1[[3,4,11,12,19,20,27,28,35,36,43,44]] = 1  # vertical bar
p2 = np.ones(64) * -1; p2[[24,25,26,27,28,29,30,31]] = 1             # horizontal bar
p3 = np.ones(64) * -1; p3[[0,9,18,27,36,45,54,63]] = 1               # diagonal
net.store([p1, p2, p3])

# Corrupt p1 with 30% noise and recall
corrupted = p1.copy()
flip_idx = np.random.choice(64, size=19, replace=False)
corrupted[flip_idx] *= -1
recovered = net.recall(corrupted)
accuracy = np.mean(recovered == p1)
print(f"Recovery accuracy: {accuracy:.1%}")  # typically 95-100%

The capacity limit is about 0.14N patterns for N neurons — beyond that, the energy landscape becomes too crowded and the attractors merge. But within capacity, it’s remarkable: a content-addressable memory that retrieves stored patterns by descending an energy surface.

Try It: Hopfield Pattern Recall

Three 8×8 binary patterns are stored in the network. Select a pattern, then click cells to corrupt it. Hit “Recall” to watch the network descend its energy landscape back to the stored pattern.

Energy: 0.00  |  Click cells to flip pixels, then hit Recall

3. Restricted Boltzmann Machines

Hopfield networks store and recall patterns, but they can’t generate new ones. For that, we need a probabilistic model — one that defines a distribution over configurations and lets us draw samples from it. Enter the Restricted Boltzmann Machine (RBM), introduced by Paul Smolensky in 1986 and brought to prominence by Geoffrey Hinton in 2002.

An RBM has two layers: visible units v (the data we observe) and hidden units h (latent features), with connections only between layers — no within-layer connections. This “restricted” connectivity is the key insight. The energy function is:

E(v, h) = −vTWh − bTv − cTh

The probability of a joint configuration is p(v, h) = exp(−E(v, h)) / Z, where Z sums over all 2n+m possible binary configurations of (v, h). For even modest sizes (say, 784 visible units for MNIST and 500 hidden units), that’s 21284 terms — hopelessly intractable.

But here’s the beautiful consequence of the bipartite structure: given the visible units, all hidden units are conditionally independent, and vice versa:

where σ is the sigmoid function. This makes block Gibbs sampling efficient: sample all hidden units in parallel given the visibles, then all visibles given the hiddens.

Contrastive Divergence

The gradient of the log-likelihood for an RBM has two terms: ⟨v hTdata − ⟨v hTmodel. The first term (the “positive phase”) clamps the visible layer to training data and computes expected correlations — easy. The second term (the “negative phase”) requires samples from the model’s equilibrium distribution — that’s the intractable part, since it means running a Markov chain to convergence.

Hinton’s key contribution in 2002 was showing that you don’t need to wait for convergence. Contrastive Divergence (CD-k) runs just k steps of Gibbs sampling starting from a training example. CD-1 (one step) works surprisingly well in practice:

  1. Start with training example v0
  2. Sample h0 from p(h | v0)
  3. Reconstruct v1 from p(v | h0)
  4. Sample h1 from p(h | v1)
  5. Update: ΔW = η(v0h0T − v1h1T)

The positive phase pushes energy down on real data. The negative phase pushes energy up on the model’s one-step reconstructions. Over training, the model learns to assign low energy to training-like configurations.

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

class RBM:
    def __init__(self, n_visible, n_hidden, lr=0.1):
        self.W = np.random.randn(n_visible, n_hidden) * 0.01
        self.b = np.zeros(n_visible)   # visible bias
        self.c = np.zeros(n_hidden)    # hidden bias
        self.lr = lr

    def sample_hidden(self, v):
        """p(h=1|v) = sigmoid(v @ W + c)"""
        prob_h = sigmoid(v @ self.W + self.c)
        return prob_h, (np.random.rand(*prob_h.shape) < prob_h).astype(float)

    def sample_visible(self, h):
        """p(v=1|h) = sigmoid(h @ W.T + b)"""
        prob_v = sigmoid(h @ self.W.T + self.b)
        return prob_v, (np.random.rand(*prob_v.shape) < prob_v).astype(float)

    def contrastive_divergence(self, v_data, k=1):
        """CD-k: k steps of Gibbs sampling from data."""
        # Positive phase: clamp visible to data
        prob_h0, h0 = self.sample_hidden(v_data)

        # Negative phase: k Gibbs steps
        h_k = h0
        for _ in range(k):
            prob_v_k, v_k = self.sample_visible(h_k)
            prob_h_k, h_k = self.sample_hidden(v_k)

        # Weight update: push energy down on data, up on reconstructions
        batch_size = v_data.shape[0]
        self.W += self.lr * (v_data.T @ prob_h0 - v_k.T @ prob_h_k) / batch_size
        self.b += self.lr * np.mean(v_data - v_k, axis=0)
        self.c += self.lr * np.mean(prob_h0 - prob_h_k, axis=0)
        return np.mean((v_data - prob_v_k) ** 2)  # reconstruction error

# Train on small binary patterns
patterns = np.array([
    [1,1,0,0,0,0],  [1,1,0,0,0,0],  [0,0,0,0,1,1],
    [0,0,0,0,1,1],  [1,0,1,0,1,0],  [1,0,1,0,1,0],
], dtype=float)

rbm = RBM(n_visible=6, n_hidden=4, lr=0.1)
for epoch in range(500):
    err = rbm.contrastive_divergence(patterns, k=1)
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: reconstruction error = {err:.4f}")
# Epoch 0: 0.2847 → Epoch 400: 0.0312

RBMs were historically significant: Hinton’s 2006 paper on deep belief networks used stacked RBMs for layer-wise pre-training, demonstrating that deep networks could be trained effectively. This was a catalyst for the modern deep learning era — though today, RBMs have been largely superseded by models that avoid MCMC entirely.

4. MCMC Sampling from Energy Models

We’ve seen that defining an energy function is easy, but sampling from the resulting distribution p(x) = exp(−E(x))/Z is hard. The central question: how do you draw samples when you can’t compute Z?

Markov Chain Monte Carlo (MCMC) methods construct a sequence of samples whose long-run distribution converges to p(x), without ever evaluating Z. The idea: design a random walk that spends more time in low-energy regions.

Metropolis-Hastings

Propose a candidate x′ from some proposal distribution q(x′|x). Accept it with probability min(1, exp(−(E(x′) − E(x)))). If the proposed point has lower energy, always accept. If higher energy, accept with probability that decreases exponentially with the energy difference. This guarantees the chain converges to p(x) = exp(−E(x))/Z. The partition function Z cancels in the acceptance ratio — it appears in both numerator and denominator.

Langevin Dynamics

Instead of random proposals, Langevin dynamics follows the energy gradient plus noise:

xt+1 = xt − η ∇xE(xt) + √(2η) · ε,    ε ∼ N(0, I)

The first term slides downhill on the energy surface. The noise term prevents the chain from getting stuck in a single mode. At equilibrium, the stationary distribution is exactly p(x) ∝ exp(−E(x)).

Here’s the connection that bridges the gap to modern AI: the gradient −∇xE(x) is the score functionx log p(x). Langevin dynamics is gradient ascent on log-probability plus noise. And the reverse process of a diffusion model is... Langevin dynamics. Denoising is energy descent.

import numpy as np

def energy_2d(xy, centers, heights, widths):
    """Compute energy for array of 2D points."""
    E = np.zeros(len(xy))
    for (cx, cy), h, w in zip(centers, heights, widths):
        E -= h * np.exp(-((xy[:,0]-cx)**2 + (xy[:,1]-cy)**2) / (2*w**2))
    return E

def energy_grad(xy, centers, heights, widths):
    """Gradient of the energy function (analytical)."""
    grad = np.zeros_like(xy)
    for (cx, cy), h, w in zip(centers, heights, widths):
        diff_x = xy[:,0] - cx
        diff_y = xy[:,1] - cy
        gauss = h * np.exp(-(diff_x**2 + diff_y**2) / (2*w**2))
        grad[:,0] += gauss * diff_x / w**2
        grad[:,1] += gauss * diff_y / w**2
    return grad

# Langevin dynamics: sample from the energy landscape
centers = [(-2, -1), (2, 1), (0, 3)]
heights = [3.0, 2.0, 2.5]
widths  = [0.8, 1.0, 0.7]

n_samples = 300
eta = 0.05  # step size
xy = np.random.randn(n_samples, 2) * 3  # start from random locations

for step in range(200):
    grad = energy_grad(xy, centers, heights, widths)
    noise = np.random.randn(n_samples, 2)
    xy = xy - eta * grad + np.sqrt(2 * eta) * noise

# After 200 steps, samples cluster around the three energy minima
print(f"Sample mean energies per cluster:")
for c in centers:
    nearby = xy[np.linalg.norm(xy - c, axis=1) < 1.5]
    if len(nearby) > 0:
        print(f"  Near {c}: {len(nearby)} samples, mean E = "
              f"{np.mean(energy_2d(nearby, centers, heights, widths)):.2f}")

Try It: Energy Landscape Explorer

A 2D energy surface with three wells (dark = low energy = high probability). Click anywhere to drop sample particles, then watch Langevin dynamics guide them into the energy basins. Adjust the step size to see how it affects convergence.

Click on the landscape to place particles, then hit Start

5. Score Matching — Avoiding the Partition Function

Everything we’ve built so far has required either approximating the partition function (RBMs with CD) or designing MCMC samplers that converge slowly. What if we could train a model without ever touching Z?

The key insight is the score function: s(x) = ∇x log p(x). Let’s see why it’s special:

x log p(x) = ∇x log [exp(−E(x)) / Z] = −∇xE(x) − ∇x log Z = −∇xE(x)

The gradient of log Z with respect to x is zero because Z is a constant (it doesn’t depend on x). The score is simply the negative energy gradient — and it completely specifies the distribution (up to a normalizing constant).

Aapo Hyvärinen’s 2005 score matching paper showed how to train a model sθ(x) to approximate the true score, using an objective that never requires computing Z. The original formulation uses an integration-by-parts trick to avoid the unknown true score, leaving an objective that depends only on the model’s score function and its Jacobian.

Denoising Score Matching

Pascal Vincent showed in 2011 that there’s an even simpler way. Add Gaussian noise to the data: x̃ = x + σε, where ε ∼ N(0, I). The score of the noisy distribution has a closed form:

log pσ(x̃) = (x − x̃) / σ2 = −ε / σ

This is just the direction from the noisy point back to the clean point, scaled by the noise level. Training a neural network sθ(x̃, σ) to predict this is straightforward supervised learning — no partition function, no MCMC, no integration by parts.

And here’s the punchline: this is exactly the DDPM training objective. When Ho et al. (2020) train a network εθ(xt, t) to predict the noise ε added at timestep t, they’re doing denoising score matching. The predicted noise points in the direction from the noisy data back to the clean data — that’s the score. Modern diffusion models are EBMs trained with denoising score matching.

import numpy as np

def dsm_loss(score_model, x_clean, sigma=0.5):
    """Denoising score matching loss.

    Train score_model to predict the score of the noisy distribution:
    target = -(x_noisy - x_clean) / sigma^2 = -noise / sigma
    """
    noise = np.random.randn(*x_clean.shape)
    x_noisy = x_clean + sigma * noise

    # The target score: direction from noisy back to clean
    target_score = -noise / sigma

    # Model predicts the score at the noisy point
    predicted_score = score_model(x_noisy)

    # MSE loss between predicted and target score
    loss = np.mean((predicted_score - target_score) ** 2)
    return loss

# Conceptual training loop for a score network
# (In practice, score_model would be a neural network with gradient descent)
#
# for epoch in range(1000):
#     x_batch = sample_training_data(batch_size=64)
#     sigma = np.random.choice([0.1, 0.5, 1.0, 2.0])  # multi-scale
#     loss = dsm_loss(score_model, x_batch, sigma)
#     loss.backward()
#     optimizer.step()
#
# After training: generate samples with annealed Langevin dynamics
# for sigma in [2.0, 1.0, 0.5, 0.1]:  # anneal from high to low noise
#     for step in range(100):
#         x = x - eta * score_model(x, sigma) + sqrt(2*eta) * noise

The multi-scale aspect is crucial: Yang Song and Stefano Ermon showed in 2019 that training at multiple noise levels1 > σ2 > ... > σL) dramatically improves results. High noise smooths the landscape (easy to navigate, few modes), while low noise preserves detail (precise, multimodal). Annealing from high to low noise during sampling — annealed Langevin dynamics — is exactly the reverse diffusion process.

6. The Grand Unification

We’re now in a position to see the remarkable unity underlying seemingly different generative model families. Through the EBM lens, every major approach is answering the same question: how do you learn and sample from p(x) = exp(−E(x))/Z without computing Z?

The contrastive learning connection is particularly illuminating. In models like SimCLR and CLIP, the similarity function sim(q, k) is an energy: low energy for matching pairs, high energy for mismatched pairs. The InfoNCE loss is:

import numpy as np

def infonce_loss(query, keys_pos, keys_neg, temperature=0.1):
    """InfoNCE loss as energy-based softmax.

    The similarity function is the negative energy:
    sim(q, k) = -E(q, k) = q . k / tau
    """
    # Energy of positive pair (should be LOW = high similarity)
    energy_pos = -np.dot(query, keys_pos) / temperature

    # Energies of negative pairs (should be HIGH = low similarity)
    energies_neg = -np.dot(keys_neg, query) / temperature

    # InfoNCE = -log softmax over all energies
    # = -log [exp(-E_pos) / (exp(-E_pos) + sum exp(-E_neg))]
    all_energies = np.concatenate([[energy_pos], energies_neg])
    log_Z = np.log(np.sum(np.exp(-all_energies)))  # log partition function
    loss = energy_pos + log_Z  # = -log p(positive)

    return loss

# Example: query matches keys_pos, not keys_neg
query = np.array([1.0, 0.0, 0.0])
keys_pos = np.array([0.9, 0.1, 0.0])
keys_neg = np.random.randn(5, 3)

loss = infonce_loss(query, keys_pos, keys_neg)
print(f"InfoNCE loss: {loss:.3f}")
# Low loss when positive pair has much lower energy than negatives

When you train CLIP to match images with their text descriptions, you’re learning an energy function over (image, text) space. Matched pairs sit in energy basins; mismatches are pushed to high-energy plateaus. This is the same principle that Hopfield used for associative memory in 1982, scaled to billions of parameters.

7. Conclusion

Energy-based models are the Rosetta Stone of generative AI. The framework is simple: define an energy E(x), and the distribution follows as p(x) ∝ exp(−E(x)). Every generative model we’ve built on this site — from GANs to VAEs to diffusion models — is a different strategy for the same fundamental problem: learning and sampling from this distribution without computing the intractable partition function Z.

The intellectual lineage spans four decades: Hopfield networks (1982) showed that computation could emerge from energy minimization. RBMs and contrastive divergence (2002) made energy-based learning tractable. Score matching (2005) and denoising score matching (2011) eliminated the partition function entirely. And modern diffusion models (2019–present) brought these ideas to the scale of DALL-E and Stable Diffusion.

The next time you see a generative model, ask: what’s the energy function? Where are the low-energy basins? How does it sample without computing Z? You’ll find that the answers reveal a deep unity beneath the surface diversity of modern AI.

References & Further Reading