Recurrent Neural Networks from Scratch
The Architecture Transformers Replaced
This series has spent 22 posts building transformers from the ground up — attention, embeddings, positional encoding, feed-forward networks, normalization, and every other component. We built convolutional neural networks to understand what Vision Transformers replaced in the spatial domain. But we never built what transformers replaced in the sequential domain.
From the late 1980s through 2017, Recurrent Neural Networks were the default architecture for anything involving sequences: machine translation, speech recognition, handwriting generation, music composition, time series forecasting. Google Translate's first neural system was an RNN. Siri and Alexa understood your voice through RNNs. They were everywhere.
Then, in 2017, Vaswani et al. published “Attention Is All You Need” and the world moved on. But moved on from what, exactly?
This post answers that question. We'll build three flavors of recurrent network — the vanilla RNN, the LSTM, and the GRU — from pure Python and NumPy. By the end, you'll understand the exact problem that made attention necessary, and why every design choice in transformers (positional encoding, parallelism, the KV cache) exists to fix something RNNs got wrong.
A Neural Network with a Loop
Every neural network we've built so far — from the micrograd perceptron to the feed-forward blocks inside transformers — is a function: input goes in, output comes out, done. Feed it a new input and it has zero memory of what it saw before. Each input is processed in complete isolation.
An RNN breaks this pattern with one small change: it feeds its own output back into itself. At each timestep t, the network takes two inputs — the current input xt and the previous hidden state ht−1 — and produces a new hidden state ht:
That's the entire vanilla RNN in one equation. The hidden state ht is a compressed summary of everything the network has seen up to timestep t. It's not a perfect summary — as we'll see, that compression is the root of all RNN problems — but it gives the network something feedforward networks lack: memory.
You can think of it two ways. The folded view: a single box with an arrow looping back into itself, processing one timestep at a time. The unrolled view: copy the box for each timestep and chain them together, with ht flowing from left to right. The unrolled view is just a deep feedforward network where every “layer” shares the same weights. This will matter enormously when we get to backpropagation.
import numpy as np
class RNNCell:
"""A single vanilla RNN cell."""
def __init__(self, input_dim, hidden_dim):
# Xavier initialization
scale_xh = np.sqrt(2.0 / (input_dim + hidden_dim))
scale_hh = np.sqrt(2.0 / (hidden_dim + hidden_dim))
self.W_xh = np.random.randn(input_dim, hidden_dim) * scale_xh
self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
self.b_h = np.zeros(hidden_dim)
self.hidden_dim = hidden_dim
def forward(self, x_t, h_prev):
"""One timestep: takes input x_t and previous hidden state h_prev."""
# h_t = tanh(x_t @ W_xh + h_prev @ W_hh + b_h)
self.x_t = x_t # (input_dim,)
self.h_prev = h_prev # (hidden_dim,)
self.z = x_t @ self.W_xh + h_prev @ self.W_hh + self.b_h # (hidden_dim,)
self.h_t = np.tanh(self.z) # (hidden_dim,)
return self.h_t
The key line is self.z = x_t @ self.W_xh + h_prev @ self.W_hh + self.b_h. Two matrix multiplications and an addition, squeezed through tanh. That's how a neural network remembers.
The Full Forward Pass
A single RNN cell processes one timestep. To handle a sequence, we loop over each timestep, passing the hidden state forward:
class VanillaRNN:
"""RNN that processes a full sequence."""
def __init__(self, input_dim, hidden_dim, output_dim):
self.cell = RNNCell(input_dim, hidden_dim)
# Output projection: hidden state -> output
scale_out = np.sqrt(2.0 / (hidden_dim + output_dim))
self.W_out = np.random.randn(hidden_dim, output_dim) * scale_out
self.b_out = np.zeros(output_dim)
self.hidden_dim = hidden_dim
def forward(self, xs, h_init=None):
"""Process a sequence of inputs.
xs: list of input vectors, one per timestep
Returns: list of outputs and list of hidden states
"""
if h_init is None:
h_init = np.zeros(self.hidden_dim)
hiddens = []
outputs = []
h = h_init
for x_t in xs:
h = self.cell.forward(x_t, h) # (hidden_dim,)
hiddens.append(h)
y_t = h @ self.W_out + self.b_out # (output_dim,)
outputs.append(y_t)
return outputs, hiddens
# Quick test: 5-step sequence, input_dim=3, hidden_dim=8, output_dim=4
rnn = VanillaRNN(input_dim=3, hidden_dim=8, output_dim=4)
xs = [np.random.randn(3) for _ in range(5)]
outputs, hiddens = rnn.forward(xs)
print(f"Sequence length: {len(xs)}") # 5
print(f"Hidden state shape: {hiddens[0].shape}") # (8,)
print(f"Output shape: {outputs[0].shape}") # (4,)
print(f"Final hidden state:\n{hiddens[-1].round(3)}")
Notice the sequential dependency: h at timestep t depends on h at timestep t−1. You cannot compute h5 without first computing h1 through h4. This is the fundamental bottleneck that transformers eliminate — but we're getting ahead of ourselves.
Backpropagation Through Time
Training an RNN means computing gradients with respect to the shared weights Wxh and Whh. When you unroll the RNN across time, it becomes a very deep feedforward network where every “layer” uses the same weights. Standard backprop applies — this variant is called Backpropagation Through Time (BPTT).
The key equation: the gradient of the loss at timestep T with respect to the hidden state at an earlier timestep k involves a chain of partial derivatives:
Each factor ∂ht/∂ht−1 is diag(1 − ht²) · Whh (the derivative of tanh times the weight matrix). The gradient flows backward through every single timestep. If your sequence is 100 tokens long, you're multiplying 100 matrices together.
def bptt(cell, xs, hiddens, d_outputs, W_out):
"""Backpropagation Through Time for a vanilla RNN.
Returns gradients for all parameters.
"""
T = len(xs)
hidden_dim = cell.hidden_dim
input_dim = xs[0].shape[0]
# Initialize gradients
dW_xh = np.zeros_like(cell.W_xh)
dW_hh = np.zeros_like(cell.W_hh)
db_h = np.zeros_like(cell.b_h)
dW_out = np.zeros((hidden_dim, d_outputs[0].shape[0]))
db_out = np.zeros(d_outputs[0].shape[0])
# Gradient flowing back from future timesteps
dh_next = np.zeros(hidden_dim)
for t in reversed(range(T)):
# Gradient from output at this timestep
dy = d_outputs[t] # (output_dim,)
dW_out += hiddens[t][:, None] @ dy[None, :]
db_out += dy
# Gradient into hidden state: from output + from future
dh = dy @ W_out.T + dh_next # (hidden_dim,)
# Backprop through tanh: d/dz = (1 - tanh^2(z)) * dh
dtanh = (1 - hiddens[t] ** 2) * dh # (hidden_dim,)
# Gradients for weights
dW_xh += xs[t][:, None] @ dtanh[None, :]
h_prev = hiddens[t-1] if t > 0 else np.zeros(hidden_dim)
dW_hh += h_prev[:, None] @ dtanh[None, :]
db_h += dtanh
# Pass gradient to previous timestep
dh_next = dtanh @ cell.W_hh.T
return dW_xh, dW_hh, db_h, dW_out, db_out
Look at line dh_next = dtanh @ cell.W_hh.T. That's the gradient being passed backward one timestep. It gets multiplied by WhhT at every step. And that repeated multiplication is where everything falls apart.
The Vanishing Gradient Problem
Here's the core issue. When you multiply a gradient by WhhT at every timestep, what happens depends on the singular values of that matrix:
- If the largest singular value σmax < 1: the gradient shrinks exponentially. After 50 timesteps, a gradient of 1.0 becomes 0.950 ≈ 0.005. The network can't learn long-range dependencies.
- If σmax > 1: the gradient explodes exponentially. After 50 timesteps, 1.150 ≈ 117. Training becomes numerically unstable.
The tanh derivative makes things worse — its maximum value is 1 (at zero), and it approaches 0 for large activations, so it can only shrink the gradient further.
This isn't an edge case or a training trick gone wrong. It's a mathematical certainty for vanilla RNNs. Any sequence longer than about 10–20 timesteps will have near-zero gradients for early positions. The network literally cannot learn that the first word in a sentence affects the last word.
Gradient clipping — capping the gradient norm to a maximum value — fixes the exploding gradient problem. But it can't fix vanishing gradients. You can prevent a gradient from becoming too large, but you can't conjure signal from nothing. This is why gradient clipping is standard practice for RNN training but doesn't solve the fundamental limitation.
Hochreiter and Schmidhuber understood this in 1997, and their solution changed deep learning forever.
Interactive: Vanishing Gradient Visualizer
Watch how gradients flow backward through time. Adjust the weight matrix singular value to see the phase transition between vanishing and exploding gradients. Toggle LSTM to see how the cell state “highway” preserves gradient signal.
LSTM: Learning to Remember
In 1997, Sepp Hochreiter and Jürgen Schmidhuber published the Long Short-Term Memory network — arguably the most important neural architecture between the perceptron and the transformer. The key insight: if multiplying by a weight matrix at every timestep causes gradients to vanish, don't multiply by a weight matrix at every timestep.
The LSTM introduces a cell state ct that flows through time with only element-wise operations — no matrix multiplications. Information on this “highway” can travel across arbitrarily long sequences without the gradient degrading. Three gates control what gets added to, removed from, and read out of the cell state:
Input gate: it = σ(Wi · [ht−1, xt] + bi) — what new info to write
Candidate: c̃t = tanh(Wc · [ht−1, xt] + bc) — the new info itself
Cell state: ct = ft ⊙ ct−1 + it ⊙ c̃t — blend old and new
Output gate: ot = σ(Wo · [ht−1, xt] + bo) — what to reveal
Hidden state: ht = ot ⊙ tanh(ct) — the output
Look at the cell state update: ct = ft ⊙ ct−1 + it ⊙ c̃t. The ⊙ symbol means element-wise multiplication. There's no Whh matrix in this path. The forget gate ft can be close to 1 for dimensions the network wants to remember, creating a near-perfect gradient highway. This is why LSTMs can learn dependencies across 100+ timesteps where vanilla RNNs fail at 15.
class LSTMCell:
"""A single LSTM cell with forget, input, and output gates."""
def __init__(self, input_dim, hidden_dim):
self.hidden_dim = hidden_dim
concat_dim = input_dim + hidden_dim
scale = np.sqrt(2.0 / (concat_dim + hidden_dim))
# Four gate weight matrices (concatenated input [h, x])
self.W_f = np.random.randn(concat_dim, hidden_dim) * scale # forget
self.W_i = np.random.randn(concat_dim, hidden_dim) * scale # input
self.W_c = np.random.randn(concat_dim, hidden_dim) * scale # candidate
self.W_o = np.random.randn(concat_dim, hidden_dim) * scale # output
# Biases (forget gate bias initialized to 1 — remember by default)
self.b_f = np.ones(hidden_dim)
self.b_i = np.zeros(hidden_dim)
self.b_c = np.zeros(hidden_dim)
self.b_o = np.zeros(hidden_dim)
def forward(self, x_t, h_prev, c_prev):
"""One timestep. Returns new hidden state and cell state."""
concat = np.concatenate([h_prev, x_t]) # (input_dim + hidden_dim,)
# Gates (sigmoid squashes to [0, 1] — they're "soft switches")
f = sigmoid(concat @ self.W_f + self.b_f) # forget gate
i = sigmoid(concat @ self.W_i + self.b_i) # input gate
c_hat = np.tanh(concat @ self.W_c + self.b_c) # candidate
o = sigmoid(concat @ self.W_o + self.b_o) # output gate
# Cell state update: erase some old, add some new
c_t = f * c_prev + i * c_hat # (hidden_dim,)
# Hidden state: filtered view of cell state
h_t = o * np.tanh(c_t) # (hidden_dim,)
return h_t, c_t
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
One subtle but critical detail: we initialize the forget gate bias to 1 (self.b_f = np.ones(hidden_dim)). This means the forget gate starts close to σ(1) ≈ 0.73, so the LSTM defaults to remembering. Without this trick (discovered by Jozefowicz et al., 2015), LSTMs learn much more slowly because they start by erasing everything.
GRU: A Simpler Alternative
In 2014, Cho et al. asked: do we really need four gates? The Gated Recurrent Unit simplifies the LSTM by merging the forget and input gates into a single “update gate” and combining the cell state and hidden state into one vector. Two gates instead of four, fewer parameters, often comparable performance:
Update gate: zt = σ(Wz · [ht−1, xt]) — blend ratio of old vs new
Candidate: h̃t = tanh(Wh · [rt ⊙ ht−1, xt]) — proposed new state
Hidden state: ht = (1 − zt) ⊙ ht−1 + zt ⊙ h̃t — interpolate old and new
The update gate zt plays both roles: when zt ≈ 0, the hidden state is preserved unchanged (like a forget gate of 1); when zt ≈ 1, the hidden state is replaced with the candidate (like an input gate of 1). One gate, two jobs.
class GRUCell:
"""A single GRU cell — simpler than LSTM, often equally effective."""
def __init__(self, input_dim, hidden_dim):
self.hidden_dim = hidden_dim
concat_dim = input_dim + hidden_dim
scale = np.sqrt(2.0 / (concat_dim + hidden_dim))
self.W_r = np.random.randn(concat_dim, hidden_dim) * scale # reset
self.W_z = np.random.randn(concat_dim, hidden_dim) * scale # update
self.W_h = np.random.randn(concat_dim, hidden_dim) * scale # candidate
self.b_r = np.zeros(hidden_dim)
self.b_z = np.zeros(hidden_dim)
self.b_h = np.zeros(hidden_dim)
def forward(self, x_t, h_prev):
"""One timestep. Returns new hidden state."""
concat = np.concatenate([h_prev, x_t])
r = sigmoid(concat @ self.W_r + self.b_r) # reset gate
z = sigmoid(concat @ self.W_z + self.b_z) # update gate
# Candidate uses reset-gated version of previous hidden state
concat_reset = np.concatenate([r * h_prev, x_t])
h_hat = np.tanh(concat_reset @ self.W_h + self.b_h)
# Interpolate between old and new
h_t = (1 - z) * h_prev + z * h_hat # (hidden_dim,)
return h_t
Here's the parameter comparison across all three architectures:
| Architecture | Weight Matrices | Parameters (d=128, input=64) | Gates |
|---|---|---|---|
| Vanilla RNN | Wxh, Whh | 24,704 | 0 |
| GRU | Wr, Wz, Wh | 74,112 | 2 |
| LSTM | Wf, Wi, Wc, Wo | 98,816 | 3 + cell |
LSTM has ~4× the parameters of a vanilla RNN. GRU has ~3×. The extra cost buys you the ability to learn dependencies across hundreds of timesteps instead of tens. In practice, GRU and LSTM perform similarly on most tasks — GRU is preferred when parameter budget is tight, LSTM when you need maximum capacity.
Character-Level Text Generation
The best way to see the difference between these architectures is to train them on a task that requires memory: predicting the next character in a text sequence. This is the simplest possible language model — the same task GPT does with subword tokens, but at the character level.
The setup: encode each character as a one-hot vector, feed the sequence into the RNN one character at a time, and at each step predict the next character using a softmax over the vocabulary. The loss is cross-entropy, exactly as in the loss functions post.
def char_rnn_forward(cell, xs_onehot, W_out, b_out, h_init, c_init=None):
"""Forward pass for character-level language model.
Returns outputs (logits), hidden states, and cell states (if LSTM).
"""
hiddens, cells, outputs = [], [], []
h = h_init
c = c_init # Only used for LSTM
for x_t in xs_onehot:
if c is not None: # LSTM
h, c = cell.forward(x_t, h, c)
cells.append(c)
else: # Vanilla RNN or GRU
h = cell.forward(x_t, h)
hiddens.append(h)
logits = h @ W_out + b_out # (vocab_size,)
outputs.append(logits)
return outputs, hiddens, cells
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum()
def cross_entropy_loss(logits_list, targets_onehot):
"""Average cross-entropy loss over all timesteps."""
loss = 0.0
for logits, target in zip(logits_list, targets_onehot):
probs = softmax(logits)
loss -= np.log(probs[target.argmax()] + 1e-12)
return loss / len(logits_list)
# Training sketch (simplified):
# 1. Sample a random window from the text
# 2. One-hot encode the characters
# 3. Forward pass through the RNN
# 4. Compute cross-entropy loss at each timestep
# 5. BPTT to get gradients
# 6. Update weights with gradient descent
# 7. Repeat for thousands of iterations
What does each architecture learn? Given enough training on English text:
- Vanilla RNN: learns character frequencies, common bigrams (th, he, in), and short words. Beyond ~10 characters of context, it loses coherence and generates plausible-looking but meaningless character sequences.
- LSTM: learns word-level patterns, common phrases, bracket matching (opening and closing quotes), and maintains topic coherence for 50+ characters. Can generate recognizable English sentences.
- GRU: similar quality to LSTM on most text. Slightly faster to train due to fewer parameters.
Andrej Karpathy's legendary 2015 blog post “The Unreasonable Effectiveness of Recurrent Neural Networks” showed character-level LSTMs generating Shakespeare, Wikipedia articles, LaTeX source code, and even C code with matched braces. The LSTM was learning structure far beyond what anyone expected from a character-level model.
Interactive: RNN Memory Trace
Type a sentence and watch the hidden state evolve character by character. Each column is one hidden dimension, color-coded from blue (negative) to red (positive). Compare how vanilla RNN states “drift” while LSTM states preserve specific information across long sequences.
The Bridge to Transformers
Now we can state precisely what's wrong with RNNs and why transformers exist.
The bottleneck problem. An RNN encodes a 100-word sentence into a single hidden vector — typically 256 to 1024 dimensions. Everything the model knows about the sentence must be compressed into that one vector. For short sentences this works fine. For long ones, early information gets overwritten. The LSTM slows the overwriting, but it can't stop it.
In 2014, Bahdanau, Cho, and Bengio proposed a radical fix: don't compress. Instead, let the decoder look back at all encoder hidden states and compute a weighted sum based on relevance. This is the attention mechanism, and it was originally designed for RNNs, not transformers.
Three years later, Vaswani et al. asked the obvious next question: if attention is doing all the heavy lifting, why keep the RNN at all? Strip out the recurrence entirely, keep only attention, add positional encoding to tell the model about token order (since it no longer gets position for free from sequential processing), and you have the transformer.
The payoff is parallelism. An RNN must process tokens sequentially — ht depends on ht−1, full stop. A transformer computes attention over all tokens simultaneously. The attention computation is O(n²) in sequence length (more work per step), but it's fully parallelizable across GPU cores. On modern hardware, doing more math in parallel is vastly faster than doing less math sequentially.
| Property | RNN | Transformer |
|---|---|---|
| Processing | Sequential (ht depends on ht−1) | Parallel (all tokens at once) |
| Long-range deps | Vanishing gradient limits range | Direct attention at any distance |
| Position info | Free (from sequential order) | Must be added (positional encoding) |
| Inference memory | O(1) per token (fixed hidden state) | O(n) per token (growing KV cache) |
| Training compute | O(n) sequential, can't parallelize | O(n²) but fully parallelizable |
One area where RNNs actually win: inference memory. An RNN processes one token and updates its fixed-size hidden state — O(1) memory regardless of sequence length. A transformer must store the KV cache for all previous tokens, which grows linearly with sequence length. This is why state space models like Mamba (Gu & Dao, 2023) revisit the recurrent approach — they achieve transformer-quality results with RNN-like O(1) inference cost. The story isn't over.
Interactive: Sequential vs Parallel Processing
See the fundamental difference between RNN and transformer processing. The RNN must process tokens one at a time (each depends on the previous). The transformer processes all tokens in parallel. Adjust the sequence length to see how the gap widens.
Connections to the Series
RNNs connect to nearly every concept in the elementary series — in many cases, they motivated the very design choices we studied:
- Micrograd — BPTT is standard backpropagation unrolled across timesteps; the same chain rule, applied across a deeper computational graph
- Attention — attention was invented specifically to fix the RNN bottleneck (Bahdanau et al., 2014), then Vaswani et al. removed the RNN entirely
- Positional Encoding — RNNs get position information for free from sequential processing; transformers need explicit positional encoding because they process all tokens in parallel
- KV Cache — transformers need the KV cache for efficient autoregressive generation; RNNs are naturally autoregressive with O(1) state
- Embeddings — character and word embeddings feed into RNNs the same way token embeddings feed into transformers
- Loss Functions — cross-entropy loss for next-token prediction is identical whether the backbone is an RNN or transformer
- Normalization — LayerNorm + residual connections in transformers solve the same gradient flow problem that LSTM gates were designed for
- Softmax & Temperature — the same softmax distribution and temperature sampling apply to RNN outputs
- Decoding Strategies — beam search and top-k sampling were originally developed for RNN-based sequence models
- ConvNets — CNNs conquered space while RNNs conquered time; together they were the two pre-transformer paradigms for perception
References & Further Reading
- Elman (1990) — Finding Structure in Time — the paper that introduced the simple recurrent network (Elman network)
- Hochreiter & Schmidhuber (1997) — Long Short-Term Memory — the LSTM paper that solved the vanishing gradient problem for sequence modeling
- Cho et al. (2014) — Learning Phrase Representations using RNN Encoder-Decoder — introduced the GRU as a simpler alternative to LSTM
- Bahdanau, Cho & Bengio (2015) — Neural Machine Translation by Jointly Learning to Align and Translate — the paper that introduced attention for RNN-based machine translation
- Vaswani et al. (2017) — Attention Is All You Need — removed the RNN entirely, keeping only attention
- Pascanu, Mikolov & Bengio (2013) — On the difficulty of training recurrent neural networks — rigorous analysis of the vanishing and exploding gradient problem
- Karpathy (2015) — The Unreasonable Effectiveness of Recurrent Neural Networks — the blog post that showed the world what character-level LSTMs can do
- Gu & Dao (2023) — Mamba: Linear-Time Sequence Modeling with Selective State Spaces — the modern RNN revival, achieving transformer quality with recurrent-like O(1) inference