Hopfield Networks from Scratch
1. Content-Addressable Memory
Your brain doesn't search for memories — it completes them. Hear three notes of a song, and the rest floods back. See half a friend's face in a crowd, and you instantly recognize them. This isn't a lookup table with an index. It's something stranger: you're using the content itself as the query, and the memory system fills in what's missing.
In computer science, we call this content-addressable memory. A hash table can retrieve a value given an exact key, but give it a slightly wrong key and it returns nothing. What we want is a system where a noisy, partial pattern retrieves the closest clean, complete pattern. The system should be robust to corruption.
In 1982, physicist John Hopfield proposed exactly this: a recurrent neural network that stores patterns as attractors in an energy landscape. Drop in a corrupted input, and the network's dynamics roll downhill to the nearest stored pattern — like a ball settling into the nearest valley. No backpropagation, no loss function, no gradient descent. Just physics.
If you've read our energy-based models post, you've seen the Hopfield network as a stepping stone to Boltzmann machines and diffusion models. Here, we make Hopfield the destination. We'll go deep on the capacity theory most tutorials skip, build the continuous version that blows past the classical limits, and arrive at a result that still surprises people: the attention mechanism in every Transformer is mathematically identical to one step of a continuous Hopfield network. The most successful architecture in modern AI is, at its core, an associative memory from 1982.
2. Binary Hopfield Networks
A Hopfield network has N neurons, each in one of two states: +1 or −1. Not 0 and 1 — the symmetry around zero matters for the math. The network stores P patterns, where each pattern ξμ is a vector of N values in {−1, +1}.
To store patterns, we use Hebbian learning — "neurons that fire together, wire together." The weight between neurons i and j is the average correlation across all stored patterns:
Wij = (1/P) Σμ ξiμ ξjμ, with Wii = 0 (no self-connections)
In matrix form, that's W = (1/P) Σ ξμ(ξμ)T with the diagonal zeroed out. This is just a sum of outer products — no optimization, no iterative training.
To recall a pattern, we start from a corrupted input and apply asynchronous updates: pick a random neuron i, compute its local field hi = Σj Wij xj, and set xi = sign(hi). Repeat until nothing changes. Asynchronous updates are critical — updating all neurons simultaneously can cause oscillations and prevent convergence.
import numpy as np
def hopfield_store(patterns):
"""Store patterns via Hebbian learning (outer product rule)."""
N = patterns[0].shape[0]
W = np.zeros((N, N))
for p in patterns:
W += np.outer(p, p)
W /= len(patterns)
np.fill_diagonal(W, 0) # no self-connections
return W
def hopfield_recall(W, state, max_steps=1000):
"""Asynchronous update until convergence."""
N = len(state)
x = state.copy()
for step in range(max_steps):
changed = False
order = np.random.permutation(N)
for i in order:
h_i = W[i] @ x
new_val = 1 if h_i >= 0 else -1
if new_val != x[i]:
x[i] = new_val
changed = True
if not changed:
return x, step + 1
return x, max_steps
# Store 3 patterns of length 64 (imagine 8x8 images)
np.random.seed(42)
patterns = [np.random.choice([-1, 1], size=64) for _ in range(3)]
W = hopfield_store(patterns)
# Corrupt pattern 0 by flipping 25% of bits
corrupted = patterns[0].copy()
flip_idx = np.random.choice(64, size=16, replace=False)
corrupted[flip_idx] *= -1
recalled, steps = hopfield_recall(W, corrupted)
accuracy = np.mean(recalled == patterns[0])
print(f"Flipped 16/64 bits, recalled in {steps} sweeps")
print(f"Recall accuracy: {accuracy:.1%}")
# Flipped 16/64 bits, recalled in 2 sweeps
# Recall accuracy: 100.0%
The network perfectly recovers the original pattern from 25% corruption. No training loop, no loss function — just outer products and sign flips. But why does this work? The answer lies in energy.
3. The Energy Landscape
Every Hopfield network has an energy function:
E(x) = −½ xT W x
This maps every possible state of the network (all 2N configurations) to a single number. The key theorem is beautiful in its simplicity: asynchronous updates never increase the energy.
Here's the proof in three lines. When neuron i updates, the energy change is:
ΔE = E(xnew) − E(xold) = −Δxi · hi
where Δxi = xinew − xiold and hi = Σj Wij xj is the local field. Since we set xi = sign(hi), the update aligns xi with hi, so Δxi · hi ≥ 0, which means ΔE ≤ 0. Energy can only decrease or stay the same.
Because the state space is finite and energy decreases monotonically, the network must converge to a local minimum. Stored patterns sit in these minima — they're the valleys in the energy landscape. A corrupted pattern starts on a hillside and rolls downhill to the nearest valley.
But not every minimum corresponds to a stored pattern. Spurious states emerge — mixture states that blend two or more patterns, and random spin-glass states that don't resemble any pattern at all. These are the price of a simple learning rule.
def hopfield_energy(W, x):
"""Compute energy E = -0.5 * x^T W x."""
return -0.5 * x @ W @ x
def recall_with_energy(W, state, max_steps=100):
"""Track energy at each sweep during recall."""
N = len(state)
x = state.copy()
energies = [hopfield_energy(W, x)]
for step in range(max_steps):
changed = False
order = np.random.permutation(N)
for i in order:
h_i = W[i] @ x
new_val = 1 if h_i >= 0 else -1
if new_val != x[i]:
x[i] = new_val
changed = True
energies.append(hopfield_energy(W, x))
if not changed:
break
return x, energies
# Recall from 40% corruption to see energy descent
np.random.seed(7)
corrupted_heavy = patterns[0].copy()
flip_idx = np.random.choice(64, size=26, replace=False)
corrupted_heavy[flip_idx] *= -1
recalled, energies = recall_with_energy(W, corrupted_heavy)
for i, e in enumerate(energies):
print(f"Sweep {i}: energy = {e:.2f}")
# Sweep 0: energy = -16.00
# Sweep 1: energy = -482.00
# Sweep 2: energy = -482.00
Energy plunges from −16 to −482 in a single sweep as the vast majority of neurons snap to their correct values, then the network is already converged. This monotonic decrease is guaranteed by the math — no hyperparameter tuning, no learning rate schedule. The physics does the work.
4. Storage Capacity and the 0.138N Limit
How many patterns can N neurons store? This turns out to have a precise, beautiful answer from statistical mechanics.
When the network tries to recall pattern μ, neuron i receives a local field:
hi = ξiμ + (1/P) Σν≠μ ξiν (ξν · x)
The first term is the signal — it pushes neuron i toward the correct value. The second term is cross-talk noise from the other P−1 stored patterns. Each contribution ξiν(ξν · x) is a product of random ±1 values, and by the Central Limit Theorem, the sum of P−1 such terms approaches a Gaussian with variance roughly (P−1)/N.
Recall fails when noise overwhelms signal — when the cross-talk pushes a neuron to the wrong value with probability greater than some threshold. Amit, Gutfreund, and Sompolinsky showed in 1985, using the replica method from spin-glass theory, that the critical threshold is:
αc = P/N ≈ 0.138
Below this ratio, the network recalls patterns almost perfectly. Above it, recall accuracy collapses catastrophically — not a gradual degradation, but a sharp phase transition. All patterns become corrupted simultaneously, like an overloaded filing cabinet that doesn't just lose the last file but scrambles everything.
def capacity_experiment(N, max_patterns, trials=10):
"""Measure recall accuracy as pattern count increases."""
ratios, accuracies = [], []
for P in range(1, max_patterns + 1):
acc_sum = 0
for t in range(trials):
np.random.seed(t * 1000 + P)
pats = [np.random.choice([-1, 1], size=N) for _ in range(P)]
W = hopfield_store(pats)
# Test recall of each pattern from 10% corruption
for pat in pats:
corrupted = pat.copy()
flips = np.random.choice(N, size=max(1, N // 10), replace=False)
corrupted[flips] *= -1
recalled, _ = hopfield_recall(W, corrupted, max_steps=50)
acc_sum += np.mean(recalled == pat)
avg_acc = acc_sum / (trials * P)
ratios.append(P / N)
accuracies.append(avg_acc)
return ratios, accuracies
ratios, accuracies = capacity_experiment(N=100, max_patterns=25, trials=5)
for r, a in zip(ratios[::3], accuracies[::3]):
print(f"P/N = {r:.2f}: accuracy = {a:.3f}")
# P/N = 0.01: accuracy = 1.000
# P/N = 0.04: accuracy = 1.000
# P/N = 0.07: accuracy = 1.000
# P/N = 0.10: accuracy = 0.995
# P/N = 0.13: accuracy = 0.996
# P/N = 0.16: accuracy = 0.960
# P/N = 0.19: accuracy = 0.895
# P/N = 0.22: accuracy = 0.855
# P/N = 0.25: accuracy = 0.784
You can see the cliff: near-perfect recall up to P/N ≈ 0.14, then accuracy drops below 0.96 and degrades rapidly. With 100 neurons, you can reliably store about 13–14 patterns. This linear scaling — capacity proportional to N, not N2 or 2N — is the fundamental limitation of classical Hopfield networks. Can we do better? We can, but it requires rethinking what a "neuron" does.
5. Continuous Hopfield Networks
Binary neurons are limiting in two ways: they restrict storage to 0.138N patterns, and the binary state space prevents smooth optimization. What if we replace the hard sign() function with something softer?
In the modern formulation by Ramsauer et al. (2020), a continuous Hopfield network stores P patterns ξ1, ..., ξP (now continuous vectors in ℝd) and defines an energy function:
E(x) = −log Σμ exp(β ξμ · x) + ½ x · x + const
where β is an inverse temperature parameter. The update rule minimizes this energy by setting:
xnew = Σμ softmax(β ξμ · x) · ξμ
Read that carefully: the new state is a softmax-weighted average of all stored patterns, where the weights depend on how similar each pattern is to the current state. At high β (low temperature), softmax sharpens to a hard argmax and the network snaps to the nearest stored pattern — just like binary Hopfield, but in continuous space.
The key breakthrough: the log-sum-exp energy function gives the continuous Hopfield network exponential storage capacity. While the binary network stores at most 0.138N patterns, the continuous version can store up to ~ed/2 patterns in d dimensions — exponentially more. The soft attention over patterns creates far richer basins of attraction.
def softmax(z):
"""Numerically stable softmax."""
z_shifted = z - np.max(z)
e = np.exp(z_shifted)
return e / np.sum(e)
def continuous_hopfield_energy(patterns, x, beta=1.0):
"""Energy: -log sum exp(beta * xi . x) + 0.5 * x.x"""
dots = np.array([beta * p @ x for p in patterns])
lse = np.max(dots) + np.log(np.sum(np.exp(dots - np.max(dots))))
return -lse + 0.5 * np.dot(x, x)
def continuous_hopfield_update(patterns, x, beta=1.0):
"""One update step: x_new = sum softmax(beta * xi . x) * xi"""
dots = np.array([beta * p @ x for p in patterns])
weights = softmax(dots)
return sum(w * p for w, p in zip(weights, patterns))
# Store 5 patterns in R^20, retrieve from noisy query
np.random.seed(42)
d = 20
stored = [np.random.randn(d) for _ in range(5)]
# Normalize patterns for cleaner retrieval
stored = [p / np.linalg.norm(p) * np.sqrt(d) for p in stored]
# Corrupt pattern 0 with substantial noise
query = stored[0] + np.random.randn(d) * 1.5
print(f"Initial distance to pattern 0: {np.linalg.norm(query - stored[0]):.3f}")
# Iterate updates (lower beta = softer attention, shows gradual convergence)
x = query.copy()
for step in range(5):
e = continuous_hopfield_energy(stored, x, beta=0.5)
x = continuous_hopfield_update(stored, x, beta=0.5)
dist = np.linalg.norm(x - stored[0])
print(f"Step {step+1}: energy={e:.3f}, dist_to_p0={dist:.4f}")
# Initial distance to pattern 0: 6.693
# Step 1: energy=18.105, dist_to_p0=0.7075
# Step 2: energy=-0.996, dist_to_p0=0.0197
# Step 3: energy=-0.029, dist_to_p0=0.0063
# Step 4: energy=-0.010, dist_to_p0=0.0061
# Step 5: energy=-0.009, dist_to_p0=0.0061
By step 3, the continuous Hopfield network has essentially converged, with distance dropping from 6.7 to 0.006. Energy decreases monotonically at every step, just like in the binary case. With a higher β (sharper softmax), convergence is even faster — a single step suffices. But look at the update rule one more time: xnew = Σ softmax(β ξμ · x) · ξμ. If you've ever implemented a Transformer, this should look very familiar.
6. The Transformer Connection
This is the big reveal. Let's write the continuous Hopfield update and the Transformer's attention mechanism side by side:
Hopfield update: xnew = Σμ softmax(β ξμ · x) · ξμ
Attention: output = Σj softmax(q · kj / √d) · vj
These are the same operation. Map:
- Query
q= current statex - Keys
kj= stored patternsξμ - Values
vj= stored patternsξμ(in the self-retrieval case; they can differ) - Temperature
1/√d= inverse temperatureβ
Ramsauer et al. proved this equivalence rigorously in their 2020 paper "Hopfield Networks is All You Need" (yes, the title is a wink at Vaswani et al.). The implications are profound:
- Attention has convergence guarantees. Each attention layer minimizes an energy function. The Transformer doesn't just "learn useful features" — it's performing energy minimization with provable convergence.
- Multi-head attention = parallel Hopfield retrievals. Each head retrieves from the same stored patterns (keys/values) but with a different query, exploring different aspects of the associative memory.
- Exponential storage capacity. The softmax-based energy function gives Transformers the capacity to store exponentially many patterns — explaining their remarkable ability to memorize and retrieve from massive training corpora.
- Layer normalization ≈ constraining states to a sphere. This regularizes the energy landscape and prevents states from drifting to regions where the energy function is flat.
Let's verify this numerically. We'll run both computations on the same data and confirm they produce identical results:
def attention(query, keys, values, d_k):
"""Standard scaled dot-product attention for a single query."""
scores = np.array([query @ k / np.sqrt(d_k) for k in keys])
weights = softmax(scores)
return sum(w * v for w, v in zip(weights, values))
# Setup: 4 stored patterns in R^8
np.random.seed(123)
d = 8
patterns = [np.random.randn(d) for _ in range(4)]
query = np.random.randn(d)
# Hopfield update with beta = 1/sqrt(d)
beta = 1.0 / np.sqrt(d)
hopfield_out = continuous_hopfield_update(patterns, query, beta=beta)
# Attention with same data: keys = values = patterns
attn_out = attention(query, keys=patterns, values=patterns, d_k=d)
# Compare
print(f"Hopfield output: [{', '.join(f'{v:.4f}' for v in hopfield_out[:4])}...]")
print(f"Attention output: [{', '.join(f'{v:.4f}' for v in attn_out[:4])}...]")
print(f"Max difference: {np.max(np.abs(hopfield_out - attn_out)):.2e}")
# Hopfield output: [-0.8643, 0.5116, 0.4454, -1.3278...]
# Attention output: [-0.8643, 0.5116, 0.4454, -1.3278...]
# Max difference: 2.22e-16
The difference is 2.22×10−16 — machine epsilon, the smallest representable floating-point gap. The Hopfield continuous update and scaled dot-product attention are not just similar — they are mathematically identical. Every time a Transformer processes a token, it's performing associative memory retrieval from a Hopfield network. Every attention head is a parallel pattern-completion engine.
This gives the Transformer something remarkable: a theoretical foundation rooted in statistical physics, not just empirical success. The architecture that powers GPT, BERT, and every modern LLM isn't a clever engineering trick — it's the natural outcome of energy minimization in a continuous associative memory.
7. The Modern Hopfield Renaissance
The Hopfield-to-Transformer connection didn't appear overnight. In 2016, Krotov and Hopfield proposed dense associative memories with polynomial energy functions that achieve super-linear storage capacity. This was the stepping stone to the exponential capacity of the softmax formulation.
In October 2024, John Hopfield shared the Nobel Prize in Physics with Geoffrey Hinton "for foundational discoveries and inventions that enable machine learning with artificial neural networks." The Nobel committee specifically cited Hopfield's 1982 paper — recognizing that the physics-to-computation bridge he built is one of the most consequential ideas in science.
Modern applications extend the Hopfield framework in several directions. Memory-augmented Transformers (like those with retrieval-augmented generation) can be understood as Hopfield networks with external pattern stores. The energy-based interpretation gives us tools to analyze attention failure modes — when patterns are too similar, the energy landscape has shallow minima, explaining why Transformers can struggle with fine-grained distinctions. And Hopfield layers are being integrated directly into architectures for tasks requiring explicit memory retrieval.
For related perspectives on these ideas, see our posts on energy-based models (Hopfield as EBM precursor), attention mechanisms (the engineering view), and Transformers (the full architecture).
8. Conclusion
We've traveled from 1982 to 2024 in eight sections: binary Hopfield networks with their elegant Hebbian learning rule, the energy landscape that guarantees convergence, the 0.138N capacity wall from statistical mechanics, the continuous reformulation that shatters that wall with exponential capacity, and finally the revelation that attention — the beating heart of every modern LLM — is a Hopfield network update.
The deepest insight isn't just the mathematical equivalence. It's that John Hopfield, working at the intersection of physics and neuroscience four decades ago, discovered the computational primitive that would eventually power the AI revolution. Every time GPT completes your sentence, it's performing pattern completion in an associative memory — doing exactly what Hopfield showed a recurrent network could do, but with continuous activations and exponentially more storage.
Computation is energy minimization. Hopfield knew it in 1982. Now we all do.
References & Further Reading
- Hopfield (1982) — "Neural networks and physical systems with emergent collective computational abilities" — the original paper that started it all
- Amit, Gutfreund & Sompolinsky (1985) — "Storing Infinite Numbers of Patterns in a Spin-Glass Model of Neural Networks" — rigorous derivation of the 0.138N capacity limit
- Ramsauer et al. (2020) — "Hopfield Networks is All You Need" — the modern continuous formulation and the Transformer equivalence proof
- Krotov & Hopfield (2016) — "Dense Associative Memory for Pattern Recognition" — polynomial energy functions with super-linear capacity
- Hopfield & Tank (1985) — "Neural Computation of Decisions in Optimization Problems" — continuous Hopfield networks and optimization
- Vaswani et al. (2017) — "Attention Is All You Need" — the Transformer architecture, unknowingly a deep Hopfield network
- Nobel Prize Committee (2024) — Physics prize citation for Hopfield and Hinton — recognition of the physics-to-AI bridge
- Millidge, Salvatori & Lukasiewicz (2022) — "Universal Hopfield Networks" — a general framework unifying associative memory models
Try It: Pattern Memory
Draw patterns on the 6×6 grid, store them, then corrupt and recall. Watch energy decrease as the network converges.
Try It: Attention IS Hopfield
Click the landscape to place a query. Both Hopfield update and attention produce the same output — verified live.