Energy-Based Models from Scratch
1. The Energy-Probability Connection
Every probability distribution can be written as p(x) = exp(−E(x)) / Z. This isn’t just a mathematical trick — it’s a complete worldview. Instead of defining probabilities directly (which requires enforcing that everything sums to one — hard), you define an energy function E(x) that assigns a scalar to each configuration, and let the Boltzmann distribution handle the rest.
The physics analogy is immediate: a ball rolling on a landscape settles into valleys. Low-energy states are stable and likely. High-energy states are unstable and rare. Ludwig Boltzmann formalized this in the 1870s for thermodynamic systems, and the same mathematics governs everything from protein folding to modern generative AI.
The distribution p(x) = exp(−E(x)) / Z is called the Gibbs distribution (or Boltzmann distribution). The numerator exp(−E(x)) is easy to evaluate — just plug x into your energy function and exponentiate. The denominator Z = ∫ exp(−E(x)) dx is the partition function, a normalizing constant that sums over every possible configuration of x. For any interesting model, this integral is intractable. You can’t compute it analytically, and numerical integration fails in high dimensions.
This is the central tension of energy-based models: defining the energy is easy, but working with the resulting probability distribution is hard. The entire history of EBMs — from Hopfield networks in 1982 to score-based diffusion models in 2019 — is a series of increasingly clever ways to learn and sample from p(x) without ever computing Z.
Let’s start by building a simple energy landscape and seeing what it implies about probability:
import numpy as np
def energy(x, y, centers, heights, widths):
"""2D energy landscape: sum of inverted Gaussians (wells)."""
E = 0.0
for (cx, cy), h, w in zip(centers, heights, widths):
E -= h * np.exp(-((x - cx)**2 + (y - cy)**2) / (2 * w**2))
return E
# Three energy wells at different locations
centers = [(-2, -1), (2, 1), (0, 3)]
heights = [3.0, 2.0, 2.5]
widths = [0.8, 1.0, 0.7]
# Evaluate on a grid
xs = np.linspace(-5, 5, 200)
ys = np.linspace(-3, 5, 200)
X, Y = np.meshgrid(xs, ys)
E = energy(X, Y, centers, heights, widths)
# Unnormalized probability: p(x) proportional to exp(-E(x))
unnorm_p = np.exp(-E)
# The partition function Z would be the integral of unnorm_p over all space
# We can approximate it on our grid:
dx = xs[1] - xs[0]
dy = ys[1] - ys[0]
Z_approx = np.sum(unnorm_p) * dx * dy
print(f"Approximate Z = {Z_approx:.2f}")
# Z_approx ~ 104.53 — grows with domain size and dimensionality
The energy function defines three “wells” — regions of low energy. When we exponentiate the negative energy, those wells become peaks of high probability. The deepest well (height 3.0 at (−2, −1)) becomes the tallest probability peak. Notice that we can approximate Z on a 2D grid, but in 1000 dimensions? Forget it — that’s 2001000 grid points.
2. Hopfield Networks — Energy Minimization as Memory
Before energy-based models became a framework for generative AI, John Hopfield showed in 1982 that neural networks could be understood through energy functions. A Hopfield network is a recurrent network that stores patterns as low-energy attractors — dips in an energy landscape that the network dynamics naturally fall into.
The setup is elegant: N binary neurons with states si ∈ {−1, +1}, connected by symmetric weights wij = wji with no self-connections. The energy function is:
E = −½ Σi,j wij si sj
The update rule is simple: pick a random neuron i, compute its local field hi = Σj wij sj, and set si = sign(hi). Each such flip is guaranteed to decrease (or maintain) the energy — making E a Lyapunov function. The network always converges to a local minimum.
To store patterns, we use Hebbian learning: W = (1/N) Σk p(k) (p(k))T, where each p(k) is a pattern to memorize. This sets the weights so that each stored pattern is a local energy minimum. When you initialize the network with a corrupted version of a stored pattern and run the dynamics, it flows downhill in energy toward the nearest stored pattern — associative memory through energy minimization.
import numpy as np
class HopfieldNetwork:
def __init__(self, n_neurons):
self.n = n_neurons
self.W = np.zeros((n_neurons, n_neurons))
def store(self, patterns):
"""Store patterns using Hebbian learning."""
self.W = np.zeros((self.n, self.n))
for p in patterns:
self.W += np.outer(p, p) / self.n
np.fill_diagonal(self.W, 0) # no self-connections
def energy(self, state):
return -0.5 * state @ self.W @ state
def recall(self, probe, max_steps=20):
"""Run async dynamics until convergence."""
state = probe.copy()
for step in range(max_steps):
changed = False
order = np.random.permutation(self.n)
for i in order:
h_i = self.W[i] @ state
new_si = 1 if h_i >= 0 else -1
if new_si != state[i]:
state[i] = new_si
changed = True
if not changed:
break
return state
# Store 3 patterns in a 64-neuron network (8x8 grid)
net = HopfieldNetwork(64)
p1 = np.ones(64) * -1; p1[[3,4,11,12,19,20,27,28,35,36,43,44]] = 1 # vertical bar
p2 = np.ones(64) * -1; p2[[24,25,26,27,28,29,30,31]] = 1 # horizontal bar
p3 = np.ones(64) * -1; p3[[0,9,18,27,36,45,54,63]] = 1 # diagonal
net.store([p1, p2, p3])
# Corrupt p1 with 30% noise and recall
corrupted = p1.copy()
flip_idx = np.random.choice(64, size=19, replace=False)
corrupted[flip_idx] *= -1
recovered = net.recall(corrupted)
accuracy = np.mean(recovered == p1)
print(f"Recovery accuracy: {accuracy:.1%}") # typically 95-100%
The capacity limit is about 0.14N patterns for N neurons — beyond that, the energy landscape becomes too crowded and the attractors merge. But within capacity, it’s remarkable: a content-addressable memory that retrieves stored patterns by descending an energy surface.
Try It: Hopfield Pattern Recall
Three 8×8 binary patterns are stored in the network. Select a pattern, then click cells to corrupt it. Hit “Recall” to watch the network descend its energy landscape back to the stored pattern.
3. Restricted Boltzmann Machines
Hopfield networks store and recall patterns, but they can’t generate new ones. For that, we need a probabilistic model — one that defines a distribution over configurations and lets us draw samples from it. Enter the Restricted Boltzmann Machine (RBM), introduced by Paul Smolensky in 1986 and brought to prominence by Geoffrey Hinton in 2002.
An RBM has two layers: visible units v (the data we observe) and hidden units h (latent features), with connections only between layers — no within-layer connections. This “restricted” connectivity is the key insight. The energy function is:
E(v, h) = −vTWh − bTv − cTh
The probability of a joint configuration is p(v, h) = exp(−E(v, h)) / Z, where Z sums over all 2n+m possible binary configurations of (v, h). For even modest sizes (say, 784 visible units for MNIST and 500 hidden units), that’s 21284 terms — hopelessly intractable.
But here’s the beautiful consequence of the bipartite structure: given the visible units, all hidden units are conditionally independent, and vice versa:
- p(hj = 1 | v) = σ(Wj · v + cj)
- p(vi = 1 | h) = σ(Wi · h + bi)
where σ is the sigmoid function. This makes block Gibbs sampling efficient: sample all hidden units in parallel given the visibles, then all visibles given the hiddens.
Contrastive Divergence
The gradient of the log-likelihood for an RBM has two terms: ⟨v hT⟩data − ⟨v hT⟩model. The first term (the “positive phase”) clamps the visible layer to training data and computes expected correlations — easy. The second term (the “negative phase”) requires samples from the model’s equilibrium distribution — that’s the intractable part, since it means running a Markov chain to convergence.
Hinton’s key contribution in 2002 was showing that you don’t need to wait for convergence. Contrastive Divergence (CD-k) runs just k steps of Gibbs sampling starting from a training example. CD-1 (one step) works surprisingly well in practice:
- Start with training example v0
- Sample h0 from p(h | v0)
- Reconstruct v1 from p(v | h0)
- Sample h1 from p(h | v1)
- Update: ΔW = η(v0h0T − v1h1T)
The positive phase pushes energy down on real data. The negative phase pushes energy up on the model’s one-step reconstructions. Over training, the model learns to assign low energy to training-like configurations.
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
class RBM:
def __init__(self, n_visible, n_hidden, lr=0.1):
self.W = np.random.randn(n_visible, n_hidden) * 0.01
self.b = np.zeros(n_visible) # visible bias
self.c = np.zeros(n_hidden) # hidden bias
self.lr = lr
def sample_hidden(self, v):
"""p(h=1|v) = sigmoid(v @ W + c)"""
prob_h = sigmoid(v @ self.W + self.c)
return prob_h, (np.random.rand(*prob_h.shape) < prob_h).astype(float)
def sample_visible(self, h):
"""p(v=1|h) = sigmoid(h @ W.T + b)"""
prob_v = sigmoid(h @ self.W.T + self.b)
return prob_v, (np.random.rand(*prob_v.shape) < prob_v).astype(float)
def contrastive_divergence(self, v_data, k=1):
"""CD-k: k steps of Gibbs sampling from data."""
# Positive phase: clamp visible to data
prob_h0, h0 = self.sample_hidden(v_data)
# Negative phase: k Gibbs steps
h_k = h0
for _ in range(k):
prob_v_k, v_k = self.sample_visible(h_k)
prob_h_k, h_k = self.sample_hidden(v_k)
# Weight update: push energy down on data, up on reconstructions
batch_size = v_data.shape[0]
self.W += self.lr * (v_data.T @ prob_h0 - v_k.T @ prob_h_k) / batch_size
self.b += self.lr * np.mean(v_data - v_k, axis=0)
self.c += self.lr * np.mean(prob_h0 - prob_h_k, axis=0)
return np.mean((v_data - prob_v_k) ** 2) # reconstruction error
# Train on small binary patterns
patterns = np.array([
[1,1,0,0,0,0], [1,1,0,0,0,0], [0,0,0,0,1,1],
[0,0,0,0,1,1], [1,0,1,0,1,0], [1,0,1,0,1,0],
], dtype=float)
rbm = RBM(n_visible=6, n_hidden=4, lr=0.1)
for epoch in range(500):
err = rbm.contrastive_divergence(patterns, k=1)
if epoch % 100 == 0:
print(f"Epoch {epoch}: reconstruction error = {err:.4f}")
# Epoch 0: 0.2847 → Epoch 400: 0.0312
RBMs were historically significant: Hinton’s 2006 paper on deep belief networks used stacked RBMs for layer-wise pre-training, demonstrating that deep networks could be trained effectively. This was a catalyst for the modern deep learning era — though today, RBMs have been largely superseded by models that avoid MCMC entirely.
4. MCMC Sampling from Energy Models
We’ve seen that defining an energy function is easy, but sampling from the resulting distribution p(x) = exp(−E(x))/Z is hard. The central question: how do you draw samples when you can’t compute Z?
Markov Chain Monte Carlo (MCMC) methods construct a sequence of samples whose long-run distribution converges to p(x), without ever evaluating Z. The idea: design a random walk that spends more time in low-energy regions.
Metropolis-Hastings
Propose a candidate x′ from some proposal distribution q(x′|x). Accept it with probability min(1, exp(−(E(x′) − E(x)))). If the proposed point has lower energy, always accept. If higher energy, accept with probability that decreases exponentially with the energy difference. This guarantees the chain converges to p(x) = exp(−E(x))/Z. The partition function Z cancels in the acceptance ratio — it appears in both numerator and denominator.
Langevin Dynamics
Instead of random proposals, Langevin dynamics follows the energy gradient plus noise:
xt+1 = xt − η ∇xE(xt) + √(2η) · ε, ε ∼ N(0, I)
The first term slides downhill on the energy surface. The noise term prevents the chain from getting stuck in a single mode. At equilibrium, the stationary distribution is exactly p(x) ∝ exp(−E(x)).
Here’s the connection that bridges the gap to modern AI: the gradient −∇xE(x) is the score function ∇x log p(x). Langevin dynamics is gradient ascent on log-probability plus noise. And the reverse process of a diffusion model is... Langevin dynamics. Denoising is energy descent.
import numpy as np
def energy_2d(xy, centers, heights, widths):
"""Compute energy for array of 2D points."""
E = np.zeros(len(xy))
for (cx, cy), h, w in zip(centers, heights, widths):
E -= h * np.exp(-((xy[:,0]-cx)**2 + (xy[:,1]-cy)**2) / (2*w**2))
return E
def energy_grad(xy, centers, heights, widths):
"""Gradient of the energy function (analytical)."""
grad = np.zeros_like(xy)
for (cx, cy), h, w in zip(centers, heights, widths):
diff_x = xy[:,0] - cx
diff_y = xy[:,1] - cy
gauss = h * np.exp(-(diff_x**2 + diff_y**2) / (2*w**2))
grad[:,0] += gauss * diff_x / w**2
grad[:,1] += gauss * diff_y / w**2
return grad
# Langevin dynamics: sample from the energy landscape
centers = [(-2, -1), (2, 1), (0, 3)]
heights = [3.0, 2.0, 2.5]
widths = [0.8, 1.0, 0.7]
n_samples = 300
eta = 0.05 # step size
xy = np.random.randn(n_samples, 2) * 3 # start from random locations
for step in range(200):
grad = energy_grad(xy, centers, heights, widths)
noise = np.random.randn(n_samples, 2)
xy = xy - eta * grad + np.sqrt(2 * eta) * noise
# After 200 steps, samples cluster around the three energy minima
print(f"Sample mean energies per cluster:")
for c in centers:
nearby = xy[np.linalg.norm(xy - c, axis=1) < 1.5]
if len(nearby) > 0:
print(f" Near {c}: {len(nearby)} samples, mean E = "
f"{np.mean(energy_2d(nearby, centers, heights, widths)):.2f}")
Try It: Energy Landscape Explorer
A 2D energy surface with three wells (dark = low energy = high probability). Click anywhere to drop sample particles, then watch Langevin dynamics guide them into the energy basins. Adjust the step size to see how it affects convergence.
5. Score Matching — Avoiding the Partition Function
Everything we’ve built so far has required either approximating the partition function (RBMs with CD) or designing MCMC samplers that converge slowly. What if we could train a model without ever touching Z?
The key insight is the score function: s(x) = ∇x log p(x). Let’s see why it’s special:
∇x log p(x) = ∇x log [exp(−E(x)) / Z] = −∇xE(x) − ∇x log Z = −∇xE(x)
The gradient of log Z with respect to x is zero because Z is a constant (it doesn’t depend on x). The score is simply the negative energy gradient — and it completely specifies the distribution (up to a normalizing constant).
Aapo Hyvärinen’s 2005 score matching paper showed how to train a model sθ(x) to approximate the true score, using an objective that never requires computing Z. The original formulation uses an integration-by-parts trick to avoid the unknown true score, leaving an objective that depends only on the model’s score function and its Jacobian.
Denoising Score Matching
Pascal Vincent showed in 2011 that there’s an even simpler way. Add Gaussian noise to the data: x̃ = x + σε, where ε ∼ N(0, I). The score of the noisy distribution has a closed form:
∇x̃ log pσ(x̃) = (x − x̃) / σ2 = −ε / σ
This is just the direction from the noisy point back to the clean point, scaled by the noise level. Training a neural network sθ(x̃, σ) to predict this is straightforward supervised learning — no partition function, no MCMC, no integration by parts.
And here’s the punchline: this is exactly the DDPM training objective. When Ho et al. (2020) train a network εθ(xt, t) to predict the noise ε added at timestep t, they’re doing denoising score matching. The predicted noise points in the direction from the noisy data back to the clean data — that’s the score. Modern diffusion models are EBMs trained with denoising score matching.
import numpy as np
def dsm_loss(score_model, x_clean, sigma=0.5):
"""Denoising score matching loss.
Train score_model to predict the score of the noisy distribution:
target = -(x_noisy - x_clean) / sigma^2 = -noise / sigma
"""
noise = np.random.randn(*x_clean.shape)
x_noisy = x_clean + sigma * noise
# The target score: direction from noisy back to clean
target_score = -noise / sigma
# Model predicts the score at the noisy point
predicted_score = score_model(x_noisy)
# MSE loss between predicted and target score
loss = np.mean((predicted_score - target_score) ** 2)
return loss
# Conceptual training loop for a score network
# (In practice, score_model would be a neural network with gradient descent)
#
# for epoch in range(1000):
# x_batch = sample_training_data(batch_size=64)
# sigma = np.random.choice([0.1, 0.5, 1.0, 2.0]) # multi-scale
# loss = dsm_loss(score_model, x_batch, sigma)
# loss.backward()
# optimizer.step()
#
# After training: generate samples with annealed Langevin dynamics
# for sigma in [2.0, 1.0, 0.5, 0.1]: # anneal from high to low noise
# for step in range(100):
# x = x - eta * score_model(x, sigma) + sqrt(2*eta) * noise
The multi-scale aspect is crucial: Yang Song and Stefano Ermon showed in 2019 that training at multiple noise levels (σ1 > σ2 > ... > σL) dramatically improves results. High noise smooths the landscape (easy to navigate, few modes), while low noise preserves detail (precise, multimodal). Annealing from high to low noise during sampling — annealed Langevin dynamics — is exactly the reverse diffusion process.
6. The Grand Unification
We’re now in a position to see the remarkable unity underlying seemingly different generative model families. Through the EBM lens, every major approach is answering the same question: how do you learn and sample from p(x) = exp(−E(x))/Z without computing Z?
- Boltzmann machines — explicit energy function, intractable Z, approximate with MCMC (contrastive divergence)
- GANs — the discriminator D(x) implicitly defines an energy E(x) ≈ −log D(x). Adversarial training avoids Z entirely by using the discriminator as a learned critic
- VAEs — the ELBO provides a variational lower bound on log p(x), which is equivalent to bounding the free energy. The encoder learns an efficient proposal distribution
- Diffusion models — the denoising network learns the score ∇x log p(x) = −∇xE(x) via denoising score matching. Sampling uses Langevin dynamics (the reverse process)
- Normalizing flows — enforce an architecture where Z = 1 by construction (volume-preserving transformations), making exact likelihood computation possible
- Contrastive learning — defines energy over (query, key) pairs, where the InfoNCE loss is a softmax over negative energies
The contrastive learning connection is particularly illuminating. In models like SimCLR and CLIP, the similarity function sim(q, k) is an energy: low energy for matching pairs, high energy for mismatched pairs. The InfoNCE loss is:
import numpy as np
def infonce_loss(query, keys_pos, keys_neg, temperature=0.1):
"""InfoNCE loss as energy-based softmax.
The similarity function is the negative energy:
sim(q, k) = -E(q, k) = q . k / tau
"""
# Energy of positive pair (should be LOW = high similarity)
energy_pos = -np.dot(query, keys_pos) / temperature
# Energies of negative pairs (should be HIGH = low similarity)
energies_neg = -np.dot(keys_neg, query) / temperature
# InfoNCE = -log softmax over all energies
# = -log [exp(-E_pos) / (exp(-E_pos) + sum exp(-E_neg))]
all_energies = np.concatenate([[energy_pos], energies_neg])
log_Z = np.log(np.sum(np.exp(-all_energies))) # log partition function
loss = energy_pos + log_Z # = -log p(positive)
return loss
# Example: query matches keys_pos, not keys_neg
query = np.array([1.0, 0.0, 0.0])
keys_pos = np.array([0.9, 0.1, 0.0])
keys_neg = np.random.randn(5, 3)
loss = infonce_loss(query, keys_pos, keys_neg)
print(f"InfoNCE loss: {loss:.3f}")
# Low loss when positive pair has much lower energy than negatives
When you train CLIP to match images with their text descriptions, you’re learning an energy function over (image, text) space. Matched pairs sit in energy basins; mismatches are pushed to high-energy plateaus. This is the same principle that Hopfield used for associative memory in 1982, scaled to billions of parameters.
7. Conclusion
Energy-based models are the Rosetta Stone of generative AI. The framework is simple: define an energy E(x), and the distribution follows as p(x) ∝ exp(−E(x)). Every generative model we’ve built on this site — from GANs to VAEs to diffusion models — is a different strategy for the same fundamental problem: learning and sampling from this distribution without computing the intractable partition function Z.
The intellectual lineage spans four decades: Hopfield networks (1982) showed that computation could emerge from energy minimization. RBMs and contrastive divergence (2002) made energy-based learning tractable. Score matching (2005) and denoising score matching (2011) eliminated the partition function entirely. And modern diffusion models (2019–present) brought these ideas to the scale of DALL-E and Stable Diffusion.
The next time you see a generative model, ask: what’s the energy function? Where are the low-energy basins? How does it sample without computing Z? You’ll find that the answers reveal a deep unity beneath the surface diversity of modern AI.
References & Further Reading
- Hopfield — Neural Networks and Physical Systems with Emergent Collective Computational Abilities (1982) — the foundational paper connecting neural networks to energy landscapes
- Hinton & Sejnowski — Learning and Relearning in Boltzmann Machines (1986) — the original Boltzmann machine learning algorithm
- Hinton — Training Products of Experts by Minimizing Contrastive Divergence (2002) — the CD algorithm that made RBM training practical
- Hyvärinen — Estimation of Non-Normalized Statistical Models by Score Matching (2005) — training without the partition function
- LeCun et al. — A Tutorial on Energy-Based Learning (2006) — comprehensive tutorial on the EBM framework
- Vincent — A Connection Between Score Matching and Denoising Autoencoders (2011) — the bridge between denoising and score estimation
- Song & Ermon — Generative Modeling by Estimating Gradients of the Data Distribution (2019) — multi-scale score matching that directly inspired modern diffusion
- Ho, Jain & Abbeel — Denoising Diffusion Probabilistic Models (2020) — DDPM, the paper that launched the diffusion model revolution
- Song et al. — Score-Based Generative Modeling through SDEs (2021) — the unified stochastic differential equation framework