Graph Neural Networks from Scratch: How AI Learns to Reason About Relationships

February 25, 2026 · Elementary AI · 13 min read

The Missing Data Structure

Every model we've built in this series assumes data lives on a nice, orderly structure. CNNs slide filters over pixel grids. Transformers process token sequences. Even state space models expect data to arrive one step at a time along a timeline. Grids and sequences are comfortable — they have a fixed shape, a natural ordering, and a predictable neighborhood.

But most of the world isn't gridded or sequential. A molecule is a tangle of atoms bonded in arbitrary patterns. A social network is a web of friendships with no canonical "first person." A codebase is a sprawl of functions calling other functions across files. A knowledge graph links concepts through typed relationships. Try flattening any of these into a flat vector or a sequence and you'll lose the very structure that makes them interesting.

This is the data structure we've been ignoring: the graph. Nodes connected by edges, with no fixed ordering and wildly varying neighborhoods. You can't just feed a graph into a CNN or a transformer — you need a fundamentally different kind of neural network, one that respects the graph's topology.

That's what we'll build today. Starting from pure Python and NumPy, we'll implement three families of Graph Neural Networks: GCN, GraphSAGE, and GAT. All three are variations on a single elegant idea called message passing — and by the end, you'll see that transformers and CNNs are just special cases of this more general framework.

Graphs as Data

Before we can build a neural network for graphs, we need to represent graphs as numbers. A graph has two core components: nodes (the entities) and edges (the connections between them). In code, we represent these with two matrices.

The adjacency matrix A is an N×N binary matrix where A[i][j] = 1 if there's an edge from node i to node j. For undirected graphs, A is symmetric. The node feature matrix X is N×F, where each row holds an F-dimensional feature vector describing that node — an atom's element type, a person's profile, a word's embedding.

import numpy as np

# A small social network: 6 people
# Edges: 0-1, 0-2, 1-2, 1-3, 3-4, 3-5, 4-5
A = np.array([
    [0, 1, 1, 0, 0, 0],
    [1, 0, 1, 1, 0, 0],
    [1, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 1, 1],
    [0, 0, 0, 1, 0, 1],
    [0, 0, 0, 1, 1, 0],
])

# Each person has a 3-dimensional feature vector
X = np.array([
    [1.0, 0.0, 0.2],  # Node 0
    [0.8, 0.3, 0.1],  # Node 1
    [0.9, 0.1, 0.3],  # Node 2
    [0.1, 0.8, 0.5],  # Node 3
    [0.0, 0.9, 0.7],  # Node 4
    [0.2, 0.7, 0.6],  # Node 5
])

# Degree matrix: how many connections each node has
D = np.diag(A.sum(axis=1))  # D[i,i] = number of neighbors of node i
print("Degrees:", A.sum(axis=1))  # [2, 3, 2, 3, 2, 2]

Notice something: nodes 0-1-2 form a tight cluster, and nodes 3-4-5 form another. Node 1 and node 3 are the bridge between them. A good graph neural network should discover this structure from the features and connectivity alone. Also notice that the feature vectors already hint at two groups — nodes 0-2 have high first-feature values, nodes 3-5 have high second-feature values. The graph structure and the features both carry signal, and a GNN combines them.

The degree matrix D counts each node's connections. It becomes crucial for normalization — a node with 100 neighbors shouldn't have 50× the magnitude of a node with 2 neighbors just because it sums more messages.

Message Passing — The Heart of GNNs

Every graph neural network, no matter how sophisticated, follows the same three-step recipe. Gilmer et al. named it the message passing framework in 2017, and it unifies every architecture we'll see today:

Message: each node prepares a message to send to its neighbors (often just its current feature vector)
Aggregate: each node collects all incoming messages and combines them (sum, mean, or max)
Update: each node updates its own representation using the aggregated message and a learnable transform

h_v^(k+1) = UPDATE( h_v^(k), AGGREGATE({ h_u^(k) : u ∈ N(v) }) )

Think of it like a game of telephone on a graph. Each node whispers its state to its neighbors. Everyone listens, combines what they heard, and updates their own understanding. After one round, each node knows about its immediate neighbors. After two rounds, it knows about neighbors-of-neighbors. After k rounds, each node's representation encodes information from its entire k-hop neighborhood.

Here's the simplest possible message passing layer — just multiply by the adjacency matrix to gather neighbor features, then apply a linear transform:

def message_passing_layer(A, H, W, activation=True):
    """
    One round of message passing.
    A: adjacency matrix (N x N)
    H: node features (N x F_in)
    W: weight matrix (F_in x F_out)
    """
    # Step 1-2: Gather and aggregate neighbor features
    # A @ H: for each node, sum up the feature vectors of its neighbors
    messages = A @ H  # (N x F_in)

    # Step 3: Transform with learnable weights + nonlinearity
    H_new = messages @ W  # (N x F_out)

    if activation:
        H_new = np.maximum(0, H_new)  # ReLU

    return H_new

# Initialize a random weight matrix
np.random.seed(42)
W = np.random.randn(3, 4) * 0.5  # 3 input features -> 4 output features

H1 = message_passing_layer(A, X, W)
print("After 1 layer, node 0's features:", H1[0].round(3))
# Node 0 now encodes information from nodes 1 and 2 (its neighbors)

That matrix multiply A @ H is doing all the heavy lifting. Row i of the result is the sum of feature vectors from node i's neighbors. This is why graphs must be represented as adjacency matrices — matrix multiplication naturally implements "gather from neighbors." It's graph convolution in one line of NumPy.

Every GNN variant we'll see — GCN, GraphSAGE, GAT — is a specific choice of how to compute messages, how to aggregate them, and how to update. The framework is the same; the details differ.

Graph Convolutional Networks — The Spectral Shortcut

Our simple message passing layer has a problem: it only sums neighbor features and ignores the node's own features. Node 0 learns about nodes 1 and 2, but forgets what it already knew about itself. Also, a node with 10 neighbors gets messages with 5× the magnitude of a node with 2 neighbors — the aggregated features scale with degree.

Kipf and Welling's Graph Convolutional Network (2017) fixes both problems with an elegant trick. First, add self-loops: set Ã = A + I, so every node is also its own neighbor. Second, normalize by degree: multiply by D̃^-1/2 on both sides to ensure that the summed messages are averaged rather than accumulated. The propagation rule is:

H^(l+1) = σ( D̃^-1/2 Ã D̃^-1/2 H^(l) W^(l) )

Where Ã = A + I (adjacency with self-loops), D̃ is the degree matrix of Ã, and σ is a nonlinearity like ReLU. The symmetric normalization D̃^-1/2 Ã D̃^-1/2 ensures that the aggregated features are properly scaled regardless of neighborhood size.

Kipf and Welling derived this rule from spectral graph theory — it's a first-order approximation of a Chebyshev spectral filter on the graph Laplacian. But in practice, you don't need to understand the spectral motivation. The spatial intuition is enough: each node averages its neighbors' features (including its own), then applies a learnable linear transform. Here it is in code:

def gcn_layer(A, H, W):
    """
    Graph Convolutional Network layer (Kipf & Welling 2017).
    The whole thing is just: sigma(D_inv_sqrt @ A_hat @ D_inv_sqrt @ H @ W)
    """
    N = A.shape[0]

    # Add self-loops: every node is its own neighbor
    A_hat = A + np.eye(N)

    # Compute degree matrix of A_hat
    D_hat = np.diag(A_hat.sum(axis=1))

    # Symmetric normalization: D^(-1/2) A D^(-1/2)
    D_inv_sqrt = np.diag(1.0 / np.sqrt(A_hat.sum(axis=1)))
    A_norm = D_inv_sqrt @ A_hat @ D_inv_sqrt

    # Propagate: normalized message passing + learned transform + ReLU
    H_new = np.maximum(0, A_norm @ H @ W)
    return H_new

# Two-layer GCN
W1 = np.random.randn(3, 8) * 0.5   # 3 -> 8 features
W2 = np.random.randn(8, 2) * 0.5   # 8 -> 2 features (for visualization)

H1 = gcn_layer(A, X, W1)           # Layer 1: mix neighbors
H2 = gcn_layer(A, H1, W2)          # Layer 2: mix neighbors-of-neighbors

print("2-layer GCN output:")
for i in range(6):
    print(f"  Node {i}: [{H2[i][0]:.3f}, {H2[i][1]:.3f}]")

That's the entire GCN layer — just five lines of linear algebra. The key insight is that A_norm @ H computes a normalized weighted average of each node's neighborhood (including itself). The weight matrix W then projects those averaged features into a new space. Stack two of these layers and each node's representation captures its 2-hop neighborhood. The resulting embeddings naturally cluster: nodes 0-1-2 end up close together, and nodes 3-4-5 end up close together, because they share neighbors and features.

Connection to CNNs: a 2D convolution is a GCN where the graph is a pixel grid and the adjacency encodes the 3×3 kernel neighborhood. GCNs generalize convolutions from regular grids to arbitrary topologies.

GraphSAGE — Learning to Generalize

GCN has a fundamental limitation: it's transductive. The normalized adjacency matrix A_norm is computed from the entire graph at once. If a new node appears after training — a new user joins the social network, a new paper enters the citation graph — you'd have to recompute A_norm and retrain. For real-world graphs that grow constantly, this is a dealbreaker.

GraphSAGE (Hamilton, Ying & Leskovec, 2017) fixes this by learning an aggregation function that works on any set of neighbors, not a fixed matrix. The trick: instead of multiplying by the global adjacency matrix, sample a fixed number of neighbors for each node and run a learned aggregator over them. The update rule concatenates a node's own features with the aggregated neighbor features:

h_v^(k) = σ( W · [ h_v^(k-1) ‖ AGG({ h_u^(k-1) : u ∈ sample(N(v)) }) ] )

The ‖ means concatenation: stack the node's own vector alongside the aggregated neighbor vector, then project through a linear layer. This separation lets the network treat "self" and "neighborhood" as distinct sources of information.

def graphsage_layer(A, H, W_self, W_neigh, num_samples=3):
    """
    GraphSAGE layer with mean aggregator and neighbor sampling.
    Learns separate transforms for self-features and neighbor-features.
    """
    N = H.shape[0]
    H_new = np.zeros((N, W_self.shape[1] + W_neigh.shape[1]))

    for v in range(N):
        # Find neighbors
        neighbors = np.where(A[v] > 0)[0]

        if len(neighbors) == 0:
            agg = np.zeros(H.shape[1])
        else:
            # Sample fixed number of neighbors (with replacement if needed)
            sampled = np.random.choice(neighbors,
                        size=min(num_samples, len(neighbors)), replace=False)
            # Mean aggregation over sampled neighbors
            agg = H[sampled].mean(axis=0)

        # Concatenate self features with aggregated neighbor features
        combined = np.concatenate([H[v] @ W_self, agg @ W_neigh])

        # ReLU activation
        H_new[v] = np.maximum(0, combined)

    # L2 normalize (as in the original paper)
    norms = np.linalg.norm(H_new, axis=1, keepdims=True) + 1e-8
    return H_new / norms

W_self = np.random.randn(3, 4) * 0.5
W_neigh = np.random.randn(3, 4) * 0.5

H_sage = graphsage_layer(A, X, W_self, W_neigh, num_samples=2)
print("GraphSAGE output shape:", H_sage.shape)  # (6, 8) - 4 self + 4 neighbor

The crucial difference from GCN: since GraphSAGE only needs a node's local neighborhood (not the full adjacency matrix), it can embed unseen nodes at inference time. A new user joins the social network? Just sample their neighbors and run the learned aggregator. The model generalizes inductively — no retraining needed. This is why Pinterest deployed GraphSAGE (as PinSage) to generate recommendations for billions of pins.

Graph Attention Networks — Attending to Neighbors

GCN treats all neighbors equally. GraphSAGE randomly samples them. But in most graphs, some neighbors matter more than others. In a citation network, a groundbreaking paper is more relevant than a tangential reference. In a social network, your close friend's opinion carries more weight than an acquaintance's.

Graph Attention Networks (Veličković et al., 2018) solve this by learning attention weights over each node's neighborhood — the same attention mechanism from our transformers post, but restricted to graph edges instead of all token pairs. For each edge (i, j), the network computes an attention coefficient:

e_ij = LeakyReLU( a^T · [Wh_i ‖ Wh_j] )
α_ij = softmax_j(e_ij) = exp(e_ij) / ∑_{k ∈ N(i)} exp(e_ik)

The attention score e_ij measures how important node j's features are to node i. The softmax normalizes across all neighbors, producing weights that sum to 1. The final aggregation is a weighted sum:

def gat_layer(A, H, W, a, negative_slope=0.2):
    """
    Graph Attention Network layer (Velickovic et al. 2018).
    W: linear transform (F_in x F_out)
    a: attention vector (2*F_out,) - learns which neighbor pairs matter
    """
    N = H.shape[0]
    # Linear transform all nodes
    Z = H @ W  # (N x F_out)
    F_out = Z.shape[1]

    # Compute attention scores for all edges
    H_new = np.zeros_like(Z)

    for i in range(N):
        neighbors = np.where(A[i] > 0)[0]
        neighbors = np.append(neighbors, i)  # include self

        if len(neighbors) == 0:
            H_new[i] = Z[i]
            continue

        # Concatenate [z_i || z_j] for each neighbor j
        z_i_repeated = np.tile(Z[i], (len(neighbors), 1))  # (num_neigh x F_out)
        z_j = Z[neighbors]                                  # (num_neigh x F_out)
        concat = np.concatenate([z_i_repeated, z_j], axis=1)  # (num_neigh x 2*F_out)

        # Attention scores: e_ij = LeakyReLU(a^T . [z_i || z_j])
        e = concat @ a  # (num_neigh,)
        e = np.where(e > 0, e, e * negative_slope)  # LeakyReLU

        # Softmax over neighbors
        e_exp = np.exp(e - e.max())  # numerical stability
        alpha = e_exp / (e_exp.sum() + 1e-8)

        # Weighted aggregation
        H_new[i] = (alpha[:, None] * z_j).sum(axis=0)

    return np.maximum(0, H_new)  # ReLU

W_gat = np.random.randn(3, 4) * 0.5
a_vec = np.random.randn(8) * 0.5  # 2 * F_out = 8

H_gat = gat_layer(A, X, W_gat, a_vec)
print("GAT output shape:", H_gat.shape)  # (6, 4)

The attention vector a is the learnable parameter that decides which neighbor pairs are important. After training, you can inspect the attention weights α_ij to see which relationships the network considers most significant — a form of built-in explainability that GCN lacks.

Just like in transformers, GAT uses multi-head attention: run K independent attention heads with separate parameters, then concatenate the outputs. Each head can learn a different notion of "importance." GAT remains efficient at O(V + E) — attention is only computed along existing edges, not between all pairs of nodes.

The transformer connection is deeper than an analogy. A transformer is literally a GAT on a fully-connected graph where every token is a neighbor of every other token. GAT restricts attention to the graph topology, which is why it's O(V + E) instead of O(V²). When you add full attention to a graph, you get a Graph Transformer — the frontier of modern GNN research.

Try It Yourself

Try It: Message Passing Visualizer

Watch messages flow through a graph. Each node starts with a distinct color. Click Step to run one round of message passing — nodes collect neighbor colors and blend them. After a few steps, connected nodes converge to similar colors as information propagates.

Aggregation:

Layer: 0Click Step to begin

Try It: Graph Attention Explorer

Click any node to see its attention weights. Edge thickness shows how much each neighbor contributes. Toggle between uniform (GCN-style, all neighbors equal) and learned (GAT-style, network decides who matters).

Click a node to inspect attention

Try It: Over-Smoothing Detector

Slide the Layers control to stack GNN layers. With 2-3 layers, nodes cluster nicely into communities. Past 5-6 layers, all node colors converge — that's over-smoothing. Toggle residual connections to see the fix.

GNN Layers: 2

Diversity: 100%2 layers

The Over-Smoothing Wall

If one GNN layer captures 1-hop neighborhoods and two layers capture 2-hop neighborhoods, why not stack 50 layers and capture the entire graph? Because of over-smoothing — the bane of deep GNNs.

Each message passing layer averages a node's features with its neighbors. After k layers, each node's representation is a smoothed average of its k-hop neighborhood. Once k is large enough that every node's receptive field covers the entire graph, all node representations converge to nearly the same vector. The network becomes unable to distinguish individual nodes — it sees a homogeneous blob.

Mathematically, repeated application of the normalized adjacency matrix acts as a low-pass filter on the graph spectrum. It smooths out high-frequency signals (differences between nearby nodes) and preserves only the lowest frequency (the global average). This is why most practical GNNs use only 2-4 layers, while transformers happily stack 32 or more.

The solutions mirror ideas from deep learning: residual connections (add the input back to the output, so the network can preserve sharp features), JKNet (concatenate outputs from all layers, letting the final classifier pick the best depth per node), and DropEdge (randomly remove edges during training to slow the smoothing). The over-smoothing demo above lets you see these effects firsthand — slide the layers up and watch diversity collapse, then toggle residual connections and watch it stabilize.

How Powerful Are GNNs?

Here's a surprising theoretical result. Xu et al. (2019) proved that standard message-passing GNNs have a hard ceiling on what they can learn. That ceiling is the Weisfeiler-Leman (WL) graph isomorphism test — a classical algorithm from the 1960s that iteratively refines node "colors" by hashing each node's neighbor color multiset.

The 1-WL test works like message passing: each round, a node updates its color based on its neighbors' colors. Two graphs are distinguishable if they produce different color histograms after convergence. Xu et al. showed that no message-passing GNN can distinguish graphs that 1-WL fails on — and with the right aggregator (sum, not mean), a GNN can be exactly as powerful as 1-WL. This led to the Graph Isomorphism Network (GIN).

def wl_color_refinement(A, num_iters=3):
    """
    Weisfeiler-Leman color refinement (1-WL).
    Iteratively updates node colors based on neighbor color multisets.
    """
    N = A.shape[0]
    colors = list(range(N))  # start: each node has a unique color
    color_map = {}
    next_color = N

    for iteration in range(num_iters):
        new_colors = []
        for v in range(N):
            neighbors = np.where(A[v] > 0)[0]
            # Hash: (own color, sorted tuple of neighbor colors)
            neighbor_colors = tuple(sorted(colors[u] for u in neighbors))
            key = (colors[v], neighbor_colors)

            if key not in color_map:
                color_map[key] = next_color
                next_color += 1
            new_colors.append(color_map[key])

        colors = new_colors
        print(f"  Iter {iteration + 1}: {colors}")

    return colors

print("WL on our graph:")
wl_colors = wl_color_refinement(A, num_iters=3)
# Nodes with identical local structure get the same final color

The practical takeaway: sum aggregation preserves the multiset of neighbor features (it can tell the difference between "three neighbors with feature 1" and "one neighbor with feature 3"), while mean and max lose this information. If you need maximum expressiveness, use sum. For most real-world tasks, however, mean works fine — the WL bound matters more for theory than practice.

From Nodes to Graphs — Graph Pooling

Everything so far produces node-level representations. That's great for node classification (which user is a bot?) or link prediction (will these two users become friends?). But what about graph-level tasks — is this molecule toxic? Is this code snippet buggy? Is this brain scan healthy?

We need to collapse all node representations into a single graph vector, and that operation must be permutation-invariant: reordering the nodes shouldn't change the result. The simplest approaches are global pooling — take the sum, mean, or max across all node features:

def graph_readout(H, mode='mean'):
    """
    Collapse node features into a single graph-level vector.
    H: node features (N x F)
    mode: 'sum', 'mean', or 'max'
    """
    if mode == 'sum':
        return H.sum(axis=0)     # preserves graph size info
    elif mode == 'mean':
        return H.mean(axis=0)    # size-invariant
    elif mode == 'max':
        return H.max(axis=0)     # captures extremes
    else:
        raise ValueError(f"Unknown mode: {mode}")

# After running a GCN, pool nodes to get a graph embedding
H_final = gcn_layer(A, X, np.random.randn(3, 4) * 0.5)
graph_vec = graph_readout(H_final, mode='mean')
print("Graph vector:", graph_vec.round(3))  # One vector for the entire graph
# Feed this into a classifier for graph-level prediction

Sum readout preserves information about graph size (a graph with 100 nodes will have a larger vector than one with 10 nodes), while mean readout is size-invariant. The choice depends on your task. For molecular property prediction, where similar-sized molecules shouldn't get unfairly different magnitudes, mean is common. For graph classification where size itself is informative, sum works better.

The CNN parallel is exact: global average pooling before the classifier head in a ConvNet is the graph equivalent of mean readout. Hierarchical pooling methods like DiffPool go further, progressively coarsening the graph like spatial max-pooling in CNNs reduces image resolution.

Where Graph Neural Networks Fit in the Series

GNNs are the most general architecture we've built. Nearly every model in this series is a special case of message passing on a specific graph topology:

Attention from Scratch — GAT applies the same attention mechanism, but restricted to graph edges instead of all token pairs. Transformer attention is message passing on a fully-connected graph.
Conv2D from Scratch — A 2D convolution is a GCN on a pixel grid where adjacency encodes the kernel neighborhood. GNNs generalize convolutions from regular grids to arbitrary topologies.
Transformer from Scratch — A transformer IS a GNN on a complete graph. Self-attention computes messages between every pair of tokens. Graph Transformers combine both: full attention with structural encodings from the graph.
RNNs from Scratch — An RNN is a GNN on a path graph — sequential processing is message passing along a chain. GNNs handle branching, cyclic, and multi-relational structures that chains can't express.
Embeddings from Scratch — GNN node embeddings are learned representations of graph structure, just as word embeddings capture semantic meaning from co-occurrence patterns.
Normalization from Scratch — The symmetric normalization in GCN is a graph-specific normalization. PairNorm and GraphNorm are GNN normalizations designed to combat over-smoothing.
Softmax & Temperature — GAT uses softmax to normalize attention coefficients over each node's neighborhood — the same softmax, now applied to graph edges.
Contrastive Learning — Graph contrastive learning (GraphCL) augments graphs by dropping nodes and edges, then uses InfoNCE to learn representations without labels.
Feed-Forward Networks — The UPDATE step in message passing is an MLP — the same feedforward building blocks, now applied per-node after aggregation.
Loss Functions — Node classification uses cross-entropy. Graph classification adds a pooling step before the loss. Link prediction uses binary cross-entropy on edge scores.

References & Further Reading

Gilmer et al. — Neural Message Passing for Quantum Chemistry (2017) — the paper that unified all GNN architectures under the message passing framework
Kipf & Welling — Semi-Supervised Classification with Graph Convolutional Networks (2017) — the GCN paper that made graph learning practical
Hamilton, Ying & Leskovec — Inductive Representation Learning on Large Graphs (2017) — GraphSAGE and the inductive learning paradigm
Veličković et al. — Graph Attention Networks (2018) — attention mechanisms for graphs
Xu et al. — How Powerful are Graph Neural Networks? (2019) — the WL expressiveness bound and GIN
Distill.pub — A Gentle Introduction to Graph Neural Networks — excellent interactive visual explanation
Thomas Kipf — Graph Convolutional Networks (blog post) — the GCN author's accessible overview of spectral and spatial approaches

The elementary series has now covered learning from grids (CNNs), learning from sequences (RNNs, transformers, SSMs), learning from examples (supervised, self-supervised, contrastive), learning from rewards (RL), and learning from graphs (GNNs). These aren't separate paradigms — they're all instances of the same idea: let neurons talk to each other, and let the structure of that conversation match the structure of the data. The graph is simply the most honest description of that conversation.