Geometric Deep Learning from Scratch
1. Symmetry as a Design Principle
A cat is still a cat whether it sits in the top-left or bottom-right of an image. A molecule is the same molecule no matter how you number its atoms. A sentence means the same thing regardless of which GPU processes which token. These are symmetries — and they are the single most powerful design principle in all of deep learning.
Every architecture you have encountered in this series was quietly exploiting a symmetry. Convolutional networks share weights across spatial positions because where a feature appears should not matter for detecting it. Graph neural networks aggregate messages from neighbors in an order-independent way because labeling node 3 as node 7 should not change the prediction. Transformers process token sets through attention that treats every position on equal footing (before positional encoding injects order).
These are not accidental design choices. Each one is the unique mathematical consequence of respecting a specific symmetry. Geometric Deep Learning, formalized by Bronstein et al. in 2021, reveals this unifying principle: specify the symmetry of your data, and the architecture follows.
To make this precise, we need two definitions:
- Invariance: the output does not change when you transform the input. Example: image classification is translation-invariant — shifting the whole image should not change the class label.
- Equivariance: the output transforms predictably when you transform the input. Example: feature detection is translation-equivariant — shifting the image shifts the feature map by the same amount.
Invariance is what you want at the end (the cat label should not move). Equivariance is what you want in intermediate layers (the feature map should move with the cat). Let us see this in code:
import numpy as np
# A simple 1D signal
signal = np.array([0, 0, 1, 3, 5, 3, 1, 0, 0, 0], dtype=float)
# A shift (translation) by 2 positions to the right
def shift(x, k):
return np.roll(x, k)
shifted = shift(signal, 2) # [0, 0, 0, 0, 1, 3, 5, 3, 1, 0]
# INVARIANT operation: sum
print(f"sum(signal) = {signal.sum()}") # 13.0
print(f"sum(shifted) = {shifted.sum()}") # 13.0 -- same!
# EQUIVARIANT operation: convolution with a kernel
kernel = np.array([1, -1]) # edge detector
conv_then_shift = shift(np.convolve(signal, kernel, mode='same'), 2)
shift_then_conv = np.convolve(shifted, kernel, mode='same')
print(f"conv then shift: {conv_then_shift}")
print(f"shift then conv: {shift_then_conv}")
# They match! Convolution commutes with translation.
Summation is invariant: the total does not care about position. Convolution is equivariant: shifting the input shifts the output. This distinction — invariance for final predictions, equivariance for intermediate representations — is the foundation everything else builds on.
2. Groups, Actions, and Representations
To say “this architecture respects that symmetry,” we need a language for symmetries. That language is group theory — but do not let the name intimidate you. A group is just a collection of transformations with three properties: you can compose any two of them, there is a do-nothing transformation (the identity), and every transformation has an undo (an inverse).
Four groups power nearly all of modern deep learning:
- T(2) — translations on a 2D grid. Slide an image left, right, up, down. This is the symmetry group of CNNs.
- Sn — permutations of n elements. Relabel graph nodes or reorder set elements. This is the symmetry group of GNNs and Transformers.
- C4 — four rotations by 0°, 90°, 180°, 270°. This is the symmetry group of rotation-equivariant CNNs.
- SO(2) — all rotations in the plane. This is the continuous rotation group, used in steerable networks.
A group action is how a group element transforms data. Translating an image, permuting graph nodes, rotating coordinates — these are all group actions. The key insight: every group action can be written as matrix multiplication. This means we can check equivariance with linear algebra.
import numpy as np
# --- C4: 90-degree rotations on a 4x4 image ---
image = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[0, 0, 0, 0],
[0, 0, 0, 0]])
rot90 = np.rot90(image, k=1) # 90 degrees counterclockwise
rot180 = np.rot90(image, k=2) # 180 degrees
rot270 = np.rot90(image, k=3) # 270 degrees
# Group axioms: rot90 composed 4 times = identity
assert np.array_equal(np.rot90(image, k=4), image)
# --- S_n: permutations on a graph adjacency matrix ---
# Path graph: 0-1-2
A = np.array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0]])
# Permutation: swap node 0 and node 2
P = np.array([[0, 0, 1],
[0, 1, 0],
[1, 0, 0]])
A_permuted = P @ A @ P.T # Correct way to permute a graph
# Node features transform as: x' = P @ x
# Adjacency transforms as: A' = P @ A @ P^T
print(f"Original A:\n{A}")
print(f"Permuted A:\n{A_permuted}")
# Structure preserved: same edges, different labeling
Notice the asymmetry: node features transform as x′ = Px (one permutation matrix), but the adjacency matrix transforms as A′ = PAPT (sandwiched between two). This is because adjacency encodes pairwise relationships — both the row index and column index must be permuted. Understanding how data transforms under the group is half the battle. The other half is building layers that respect it.
3. CNNs from Symmetry: Deriving Convolution
Here is the central miracle of geometric deep learning. We are going to derive convolution — not invent it, not motivate it, but prove that it is the only possible answer to a simple question: “What is the most general linear map that commutes with translation?”
Start with a 1D signal of length n. A general linear map is an n×n weight matrix W that maps input x to output y = Wx. Translation by one position is a shift matrix T that moves every element one step to the right (with wraparound).
We demand equivariance: W·(Tx) = T·(Wx) for all x. This means WT = TW — the weight matrix must commute with the shift.
What matrices commute with all cyclic shifts? Exactly the circulant matrices — matrices where each row is a shifted copy of the first row. And multiplication by a circulant matrix is exactly circular convolution with a kernel defined by that first row.
The parameter reduction is staggering. For a 28×28 image (784 pixels), an unconstrained linear map has 784² = 614,656 parameters. A translation-equivariant linear map with a 3×3 kernel has just 9 parameters. Symmetry did not just suggest an architecture — it compressed it by a factor of 68,000.
import numpy as np
def shift_matrix(n):
"""Cyclic shift matrix: moves element i to position (i+1) % n."""
T = np.zeros((n, n))
for i in range(n):
T[(i + 1) % n, i] = 1.0
return T
n = 8
T = shift_matrix(n)
# Start with a random weight matrix
W_random = np.random.randn(n, n)
# Project onto the space of matrices that commute with T.
# A matrix commutes with all cyclic shifts iff it is circulant.
# Circulant matrices are diagonal in the Fourier basis.
F = np.fft.fft(np.eye(n), axis=0) / np.sqrt(n) # DFT matrix
F_inv = np.conj(F).T
# Project: zero out off-diagonal elements in Fourier domain
W_fourier = F @ W_random @ F_inv
W_diag = np.diag(np.diag(W_fourier)) # keep diagonal only
W_equivariant = np.real(F_inv @ W_diag @ F) # back to spatial
# Verify: W_equivariant commutes with T
commutator = W_equivariant @ T - T @ W_equivariant
print(f"Max commutator error: {np.max(np.abs(commutator)):.2e}") # ~0
# The equivariant matrix is circulant: each row is a shifted first row
print(f"First row (the kernel): {np.round(W_equivariant[0], 2)}")
print(f"Parameters: {n} (kernel) vs {n*n} (unconstrained)")
# Convolution IS the unique translation-equivariant linear map!
Read that result carefully. We started with a completely arbitrary weight matrix, asked “which part of this respects translation symmetry?”, and got convolution. We did not choose convolution — group theory chose it for us. Every CNN ever trained was implementing the unique answer to a symmetry constraint.
4. GNNs from Symmetry: Message Passing Emerges
Graphs have permutation symmetry: relabeling nodes should not change the computation. But unlike sets, graphs have structure — an adjacency matrix that tells us which nodes are connected. So the symmetry group is not all permutations of n nodes, but only those permutations that preserve the graph structure (the automorphism group).
The question is the same as before: what is the most general linear map on node features that commutes with permutations respecting the graph? The answer: it must be a polynomial in the adjacency matrix.
Why? If W commutes with every permutation matrix P such that PAPT = A, then W must also commute with A itself (since the adjacency matrix commutes with its own automorphisms). For many graph families, the algebra of matrices commuting with A is spanned by powers of A: the identity I, A, A², and so on.
And Ak has a beautiful interpretation: entry (i, j) of Ak counts the number of walks of length k from node i to node j. So W = c0I + c1A + c2A² + … computes a weighted sum of features reachable via 0-hop (self), 1-hop (neighbors), 2-hop (neighbors of neighbors), and so on. This is multi-hop message passing.
The GCN layer from Kipf & Welling (2017) is a special case: W = c0I + c1D-1/2AD-1/2, using just the 1-hop neighborhood with degree normalization.
import numpy as np
# 5-node path graph: 0-1-2-3-4
A = np.array([[0,1,0,0,0],
[1,0,1,0,0],
[0,1,0,1,0],
[0,0,1,0,1],
[0,0,0,1,0]], dtype=float)
# Node features: one-hot identity
X = np.eye(5)
# Random unconstrained weight matrix
W_random = np.random.RandomState(42).randn(5, 5)
# Permutation: reverse node ordering (0<->4, 1<->3, 2 stays)
P = np.eye(5)[::-1]
# Test equivariance: does W @ (P @ X) == P @ (W @ X)?
lhs = W_random @ (P @ X)
rhs = P @ (W_random @ X)
print(f"Random W equivariant? {np.allclose(lhs, rhs)}") # False!
# Equivariant alternative: polynomial in A
c0, c1, c2 = 0.5, 0.3, 0.1
W_equiv = c0 * np.eye(5) + c1 * A + c2 * (A @ A)
# This graph is symmetric under reversal, so P @ A @ P.T == A
lhs = W_equiv @ (P @ X)
rhs = P @ (W_equiv @ X)
print(f"Poly(A) equivariant? {np.allclose(lhs, rhs)}") # True!
# W_equiv IS message passing: node 2 aggregates from 0,1,2,3,4 hops
print(f"Node 2 receives from: {np.round(W_equiv[2], 2)}")
# [0.1, 0.3, 0.7, 0.3, 0.1] -- weighted by hop distance!
The random matrix fails the equivariance test because it treats specific node indices as meaningful. The polynomial in A passes because it only cares about graph structure — how nodes are connected, not how they are labeled. This is exactly what we want from a graph neural network.
5. Transformers from Symmetry: Attention on Sets
A set is a graph with no edges — or equivalently, a complete graph where every node connects to every other. The symmetry group is the full permutation group Sn: any reordering of elements is valid.
Zaheer et al. (2017) proved a clean characterization: any permutation-invariant function on a set has the form f(X) = ρ(Σi φ(xi)) — apply a function to each element, sum, then transform. This is the DeepSets architecture.
But what about equivariant set functions, where permuting inputs should permute outputs? Self-attention is exactly this. Each output token is a weighted combination of all input tokens (via query-key-value), and permuting the input permutes the output in the same way. No architectural trick — attention is equivariant by construction.
This explains a deep fact about Transformers: without positional encoding, a Transformer is a set processor. It treats its input as an unordered collection. Positional encoding is how we deliberately break permutation symmetry to inject sequential structure. For language, word order matters — “dog bites man” differs from “man bites dog.” For sets (like point clouds), we leave permutation symmetry intact.
import numpy as np
def self_attention(X, W_q, W_k, W_v):
"""Minimal self-attention: X is (n_tokens, d_model)."""
Q = X @ W_q # queries
K = X @ W_k # keys
V = X @ W_v # values
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
return weights @ V
np.random.seed(42)
d = 4
W_q = np.random.randn(d, d) * 0.5
W_k = np.random.randn(d, d) * 0.5
W_v = np.random.randn(d, d) * 0.5
# 3 tokens, each of dimension 4
X = np.random.randn(3, d)
# Permutation: swap token 0 and token 2
P = np.array([[0,0,1],[0,1,0],[1,0,0]], dtype=float)
# Test: attention(P @ X) == P @ attention(X)?
out_original = self_attention(X, W_q, W_k, W_v)
out_permuted_input = self_attention(P @ X, W_q, W_k, W_v)
out_permuted_output = P @ out_original
print(f"Equivariant? {np.allclose(out_permuted_input, out_permuted_output)}")
# True! Self-attention commutes with permutations.
Self-attention is to sets what convolution is to grids and message passing is to graphs: the natural equivariant operation for the domain’s symmetry group. This is the unifying insight of geometric deep learning.
6. Steerable CNNs and Rotation Equivariance
Standard CNNs respect translation but ignore rotation. A horizontal edge detector activates strongly on horizontal edges and weakly on vertical ones — even though a vertical edge is just a 90° rotation away. The network must learn separate detectors for each orientation, wasting capacity.
Data augmentation (rotating training images) teaches approximate rotation tolerance, but it is brute force — the network still maintains redundant features internally.
Group convolution (Cohen & Welling, 2016) solves this elegantly: instead of convolving the input with the kernel in one orientation, convolve with all rotated copies of the kernel. For the C4 group (90° rotations), each kernel produces four output channels — one per rotation. The output is no longer a function on the grid alone but a function on the roto-translation group: it tells you not just where a feature is, but at which orientation.
Subsequent layers convolve over both position and orientation, maintaining equivariance throughout. The result: rotate the input image by 90° and the entire intermediate representation rotates the same way, all the way through the network. Only at the final pooling layer do we collapse orientation into an invariant prediction.
import numpy as np
def group_conv_c4(image, kernel):
"""C4 group convolution: convolve with 4 rotated copies of kernel."""
h, w = image.shape
kh, kw = kernel.shape
pad = kh // 2
padded = np.pad(image, pad, mode='wrap')
outputs = []
for r in range(4): # 0, 90, 180, 270 degrees
rotated_kernel = np.rot90(kernel, k=r)
out = np.zeros((h, w))
for i in range(h):
for j in range(w):
patch = padded[i:i+kh, j:j+kw]
out[i, j] = np.sum(patch * rotated_kernel)
outputs.append(out)
return np.stack(outputs) # shape: (4, h, w)
# Horizontal edge detector
kernel = np.array([[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]], dtype=float)
# Simple test image with a horizontal bar
image = np.zeros((8, 8))
image[3:5, 1:7] = 1.0
# Group convolution: 4 orientation channels
result = group_conv_c4(image, kernel)
print(f"Output shape: {result.shape}") # (4, 8, 8)
# Rotation 0: strong response (horizontal kernel on horizontal bar)
# Rotation 1: weak response (vertical kernel on horizontal bar)
print(f"Channel 0 (0 deg) max: {result[0].max():.1f}") # strong
print(f"Channel 1 (90 deg) max: {result[1].max():.1f}") # weak
print(f"Channel 2 (180 deg) max: {np.abs(result[2]).max():.1f}") # strong
print(f"Channel 3 (270 deg) max: {result[3].max():.1f}") # weak
# Rotate input 90 deg --> output channels cycle by 1 position
The four output channels form a representation of C4: rotating the input by 90° cycles the channels. Channel 0 becomes channel 1, channel 1 becomes channel 2, and so on. The feature map now tracks both what was detected and at which orientation — and it all transforms correctly under rotation.
Steerable CNNs (Weiler & Cesa, 2019) extend this idea to continuous rotations SO(2) using representation theory to constrain the kernels analytically, achieving exact equivariance without discretizing rotations. In practice, rotation-equivariant networks match the accuracy of standard CNNs with about one-quarter the training data on tasks with rotational symmetry, like aerial image segmentation and medical imaging.
Try It: Equivariance Explorer
Draw on the 8×8 grid, then apply transformations. Watch how the convolution output transforms equivariantly (green check) while an unconstrained linear map does not (red X).
7. The Geometric Deep Learning Blueprint
We have derived three architectures from three symmetries. Bronstein et al. (2021) crystallize this into a five-component blueprint that generates any equivariant architecture:
- Domain Ω — where does your data live? A grid, a graph, a set, a group, a manifold?
- Symmetry group G — what transformations leave the domain’s structure unchanged? Translations, permutations, rotations?
- Signal / features — functions on the domain: pixel intensities, node embeddings, point coordinates.
- Equivariant maps — layers that commute with G: convolution, message passing, attention, group convolution.
- Coarsening / pooling — local invariant aggregation that reduces resolution: stride, graph coarsening, set reduction.
Specify the first two components and the rest follows. Here is the unified view:
| Domain | Group | Equiv. Layer | Architecture |
|---|---|---|---|
| Grid | Translation T(2) | Convolution | CNN |
| Graph | Permutation (auto.) | Message Passing | GNN / GCN |
| Set | Permutation Sn | Attention / Sum | Transformer / DeepSets |
| Grid + Rotation | T(2) ⋊ C4 | Group Conv. | G-CNN |
| 3D Points | E(3) | Equivariant MP | EGNN |
The frontier pushes further. EGNN (Satorras et al., 2021) builds message passing equivariant to the Euclidean group E(3) — translations, rotations, and reflections in 3D space — enabling molecular dynamics simulation where predictions must not depend on the coordinate frame. SE(3)-Transformers (Fuchs et al., 2020) combine self-attention with SE(3) equivariance for protein structure prediction. Gauge equivariance on manifolds (Weiler et al.) handles data on curved surfaces like the Earth or a brain cortex, where even the notion of “direction” varies from point to point.
Try It: GDL Blueprint Builder
Select a domain to see its symmetry group, equivariant layer, and parameter count. Watch the symmetry group act on sample data.
8. Why Symmetry Matters: Generalization and Parsimony
Symmetry is not just an aesthetic preference — it has a direct, measurable impact on generalization. When a layer is equivariant to a group G, the number of independent parameters shrinks by roughly a factor of |G|. For a 3×3 convolution on a 28×28 image, that is a 68,000× reduction. Fewer parameters means a smaller hypothesis space, which means less overfitting and better generalization from limited data.
This is the bias-variance tradeoff in its purest form. An unconstrained model (no symmetry) has maximum variance — it can fit anything but generalizes poorly. A fully constrained model (wrong symmetry) has maximum bias — it cannot fit the data at all. The sweet spot is the correct symmetry: enough constraint to generalize, enough freedom to express the target function.
Kondor & Trivedi (2018) made this formal: the generalization error of an equivariant model scales with the number of free parameters after accounting for symmetry, not before. This explains why CNNs generalize so well from small datasets — translation equivariance eliminates 99.99% of the parameter space, leaving only the parameters that matter.
The lesson of geometric deep learning is not “use more data” or “use bigger models” — it is “use the right symmetry.” The architecture should match the structure of the problem. When it does, everything else — parameter efficiency, generalization, interpretability — follows for free.
References & Further Reading
- Bronstein et al. — Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges (2021) — the foundational survey that unifies deep learning through symmetry
- Cohen & Welling — Group Equivariant Convolutional Networks (ICML 2016) — introduced group convolution for discrete rotation equivariance
- Weiler & Cesa — General E(2)-Equivariant Steerable CNNs (NeurIPS 2019) — exact continuous rotation equivariance via representation theory
- Maron et al. — Invariant and Equivariant Graph Networks (ICLR 2019) — higher-order equivariant architectures for graphs
- Zaheer et al. — Deep Sets (NeurIPS 2017) — proved the universal form of permutation-invariant functions
- Kipf & Welling — Semi-Supervised Classification with GCNs (ICLR 2017) — the spectral GCN that popularized graph neural networks
- Satorras et al. — E(n) Equivariant Graph Neural Networks (ICML 2021) — Euclidean-equivariant message passing for molecular systems
- Fuchs et al. — SE(3)-Transformers (NeurIPS 2020) — self-attention equivariant to 3D rotations and translations
- Kondor & Trivedi — On the Generalization of Equivariance and Convolution (ICML 2018) — formal generalization bounds for equivariant networks