Recurrent Neural Networks from Scratch

February 25, 2026 · Elementary · 16 min read

The Architecture Transformers Replaced

This series has spent 22 posts building transformers from the ground up — attention, embeddings, positional encoding, feed-forward networks, normalization, and every other component. We built convolutional neural networks to understand what Vision Transformers replaced in the spatial domain. But we never built what transformers replaced in the sequential domain.

From the late 1980s through 2017, Recurrent Neural Networks were the default architecture for anything involving sequences: machine translation, speech recognition, handwriting generation, music composition, time series forecasting. Google Translate's first neural system was an RNN. Siri and Alexa understood your voice through RNNs. They were everywhere.

Then, in 2017, Vaswani et al. published “Attention Is All You Need” and the world moved on. But moved on from what, exactly?

This post answers that question. We'll build three flavors of recurrent network — the vanilla RNN, the LSTM, and the GRU — from pure Python and NumPy. By the end, you'll understand the exact problem that made attention necessary, and why every design choice in transformers (positional encoding, parallelism, the KV cache) exists to fix something RNNs got wrong.

A Neural Network with a Loop

Every neural network we've built so far — from the micrograd perceptron to the feed-forward blocks inside transformers — is a function: input goes in, output comes out, done. Feed it a new input and it has zero memory of what it saw before. Each input is processed in complete isolation.

An RNN breaks this pattern with one small change: it feeds its own output back into itself. At each timestep t, the network takes two inputs — the current input x_t and the previous hidden state h_t−1 — and produces a new hidden state h_t:

h_t = tanh(W_xh · x_t + W_hh · h_t−1 + b_h)

That's the entire vanilla RNN in one equation. The hidden state h_t is a compressed summary of everything the network has seen up to timestep t. It's not a perfect summary — as we'll see, that compression is the root of all RNN problems — but it gives the network something feedforward networks lack: memory.

You can think of it two ways. The folded view: a single box with an arrow looping back into itself, processing one timestep at a time. The unrolled view: copy the box for each timestep and chain them together, with h_t flowing from left to right. The unrolled view is just a deep feedforward network where every “layer” shares the same weights. This will matter enormously when we get to backpropagation.

import numpy as np

class RNNCell:
    """A single vanilla RNN cell."""
    def __init__(self, input_dim, hidden_dim):
        # Xavier initialization
        scale_xh = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_hh = np.sqrt(2.0 / (hidden_dim + hidden_dim))
        self.W_xh = np.random.randn(input_dim, hidden_dim) * scale_xh
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
        self.b_h = np.zeros(hidden_dim)
        self.hidden_dim = hidden_dim

    def forward(self, x_t, h_prev):
        """One timestep: takes input x_t and previous hidden state h_prev."""
        # h_t = tanh(x_t @ W_xh + h_prev @ W_hh + b_h)
        self.x_t = x_t                             # (input_dim,)
        self.h_prev = h_prev                        # (hidden_dim,)
        self.z = x_t @ self.W_xh + h_prev @ self.W_hh + self.b_h  # (hidden_dim,)
        self.h_t = np.tanh(self.z)                  # (hidden_dim,)
        return self.h_t

The key line is self.z = x_t @ self.W_xh + h_prev @ self.W_hh + self.b_h. Two matrix multiplications and an addition, squeezed through tanh. That's how a neural network remembers.

The Full Forward Pass

A single RNN cell processes one timestep. To handle a sequence, we loop over each timestep, passing the hidden state forward:

class VanillaRNN:
    """RNN that processes a full sequence."""
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.cell = RNNCell(input_dim, hidden_dim)
        # Output projection: hidden state -> output
        scale_out = np.sqrt(2.0 / (hidden_dim + output_dim))
        self.W_out = np.random.randn(hidden_dim, output_dim) * scale_out
        self.b_out = np.zeros(output_dim)
        self.hidden_dim = hidden_dim

    def forward(self, xs, h_init=None):
        """Process a sequence of inputs.
        xs: list of input vectors, one per timestep
        Returns: list of outputs and list of hidden states
        """
        if h_init is None:
            h_init = np.zeros(self.hidden_dim)

        hiddens = []
        outputs = []
        h = h_init

        for x_t in xs:
            h = self.cell.forward(x_t, h)           # (hidden_dim,)
            hiddens.append(h)
            y_t = h @ self.W_out + self.b_out        # (output_dim,)
            outputs.append(y_t)

        return outputs, hiddens

# Quick test: 5-step sequence, input_dim=3, hidden_dim=8, output_dim=4
rnn = VanillaRNN(input_dim=3, hidden_dim=8, output_dim=4)
xs = [np.random.randn(3) for _ in range(5)]
outputs, hiddens = rnn.forward(xs)

print(f"Sequence length: {len(xs)}")       # 5
print(f"Hidden state shape: {hiddens[0].shape}")  # (8,)
print(f"Output shape: {outputs[0].shape}")        # (4,)
print(f"Final hidden state:\n{hiddens[-1].round(3)}")

Notice the sequential dependency: h at timestep t depends on h at timestep t−1. You cannot compute h₅ without first computing h₁ through h₄. This is the fundamental bottleneck that transformers eliminate — but we're getting ahead of ourselves.

Backpropagation Through Time

Training an RNN means computing gradients with respect to the shared weights W_xh and W_hh. When you unroll the RNN across time, it becomes a very deep feedforward network where every “layer” uses the same weights. Standard backprop applies — this variant is called Backpropagation Through Time (BPTT).

The key equation: the gradient of the loss at timestep T with respect to the hidden state at an earlier timestep k involves a chain of partial derivatives:

∂L / ∂h_k = ∂L / ∂h_T · ∏_t=k+1^T ∂h_t / ∂h_t−1

Each factor ∂h_t/∂h_t−1 is diag(1 − h_t²) · W_hh (the derivative of tanh times the weight matrix). The gradient flows backward through every single timestep. If your sequence is 100 tokens long, you're multiplying 100 matrices together.

def bptt(cell, xs, hiddens, d_outputs, W_out):
    """Backpropagation Through Time for a vanilla RNN.
    Returns gradients for all parameters.
    """
    T = len(xs)
    hidden_dim = cell.hidden_dim
    input_dim = xs[0].shape[0]

    # Initialize gradients
    dW_xh = np.zeros_like(cell.W_xh)
    dW_hh = np.zeros_like(cell.W_hh)
    db_h = np.zeros_like(cell.b_h)
    dW_out = np.zeros((hidden_dim, d_outputs[0].shape[0]))
    db_out = np.zeros(d_outputs[0].shape[0])

    # Gradient flowing back from future timesteps
    dh_next = np.zeros(hidden_dim)

    for t in reversed(range(T)):
        # Gradient from output at this timestep
        dy = d_outputs[t]                           # (output_dim,)
        dW_out += hiddens[t][:, None] @ dy[None, :]
        db_out += dy

        # Gradient into hidden state: from output + from future
        dh = dy @ W_out.T + dh_next                 # (hidden_dim,)

        # Backprop through tanh: d/dz = (1 - tanh^2(z)) * dh
        dtanh = (1 - hiddens[t] ** 2) * dh          # (hidden_dim,)

        # Gradients for weights
        dW_xh += xs[t][:, None] @ dtanh[None, :]
        h_prev = hiddens[t-1] if t > 0 else np.zeros(hidden_dim)
        dW_hh += h_prev[:, None] @ dtanh[None, :]
        db_h += dtanh

        # Pass gradient to previous timestep
        dh_next = dtanh @ cell.W_hh.T

    return dW_xh, dW_hh, db_h, dW_out, db_out

Look at line dh_next = dtanh @ cell.W_hh.T. That's the gradient being passed backward one timestep. It gets multiplied by W_hh^T at every step. And that repeated multiplication is where everything falls apart.

The Vanishing Gradient Problem

Here's the core issue. When you multiply a gradient by W_hh^T at every timestep, what happens depends on the singular values of that matrix:

If the largest singular value σ_max < 1: the gradient shrinks exponentially. After 50 timesteps, a gradient of 1.0 becomes 0.9⁵⁰ ≈ 0.005. The network can't learn long-range dependencies.
If σ_max > 1: the gradient explodes exponentially. After 50 timesteps, 1.1⁵⁰ ≈ 117. Training becomes numerically unstable.

The tanh derivative makes things worse — its maximum value is 1 (at zero), and it approaches 0 for large activations, so it can only shrink the gradient further.

This isn't an edge case or a training trick gone wrong. It's a mathematical certainty for vanilla RNNs. Any sequence longer than about 10–20 timesteps will have near-zero gradients for early positions. The network literally cannot learn that the first word in a sentence affects the last word.

Gradient clipping — capping the gradient norm to a maximum value — fixes the exploding gradient problem. But it can't fix vanishing gradients. You can prevent a gradient from becoming too large, but you can't conjure signal from nothing. This is why gradient clipping is standard practice for RNN training but doesn't solve the fundamental limitation.

Hochreiter and Schmidhuber understood this in 1997, and their solution changed deep learning forever.

Interactive: Vanishing Gradient Visualizer

Watch how gradients flow backward through time. Adjust the weight matrix singular value to see the phase transition between vanishing and exploding gradients. Toggle LSTM to see how the cell state “highway” preserves gradient signal.

Singular value (σ): 0.90

Sequence length: 50

LSTM: Learning to Remember

In 1997, Sepp Hochreiter and Jürgen Schmidhuber published the Long Short-Term Memory network — arguably the most important neural architecture between the perceptron and the transformer. The key insight: if multiplying by a weight matrix at every timestep causes gradients to vanish, don't multiply by a weight matrix at every timestep.

The LSTM introduces a cell state c_t that flows through time with only element-wise operations — no matrix multiplications. Information on this “highway” can travel across arbitrarily long sequences without the gradient degrading. Three gates control what gets added to, removed from, and read out of the cell state:

Forget gate: f_t = σ(W_f · [h_t−1, x_t] + b_f) — what to erase from memory
Input gate: i_t = σ(W_i · [h_t−1, x_t] + b_i) — what new info to write
Candidate: c̃_t = tanh(W_c · [h_t−1, x_t] + b_c) — the new info itself
Cell state: c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t — blend old and new
Output gate: o_t = σ(W_o · [h_t−1, x_t] + b_o) — what to reveal
Hidden state: h_t = o_t ⊙ tanh(c_t) — the output

Look at the cell state update: c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t. The ⊙ symbol means element-wise multiplication. There's no W_hh matrix in this path. The forget gate f_t can be close to 1 for dimensions the network wants to remember, creating a near-perfect gradient highway. This is why LSTMs can learn dependencies across 100+ timesteps where vanilla RNNs fail at 15.

class LSTMCell:
    """A single LSTM cell with forget, input, and output gates."""
    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        concat_dim = input_dim + hidden_dim
        scale = np.sqrt(2.0 / (concat_dim + hidden_dim))

        # Four gate weight matrices (concatenated input [h, x])
        self.W_f = np.random.randn(concat_dim, hidden_dim) * scale  # forget
        self.W_i = np.random.randn(concat_dim, hidden_dim) * scale  # input
        self.W_c = np.random.randn(concat_dim, hidden_dim) * scale  # candidate
        self.W_o = np.random.randn(concat_dim, hidden_dim) * scale  # output

        # Biases (forget gate bias initialized to 1 — remember by default)
        self.b_f = np.ones(hidden_dim)
        self.b_i = np.zeros(hidden_dim)
        self.b_c = np.zeros(hidden_dim)
        self.b_o = np.zeros(hidden_dim)

    def forward(self, x_t, h_prev, c_prev):
        """One timestep. Returns new hidden state and cell state."""
        concat = np.concatenate([h_prev, x_t])      # (input_dim + hidden_dim,)

        # Gates (sigmoid squashes to [0, 1] — they're "soft switches")
        f = sigmoid(concat @ self.W_f + self.b_f)    # forget gate
        i = sigmoid(concat @ self.W_i + self.b_i)    # input gate
        c_hat = np.tanh(concat @ self.W_c + self.b_c) # candidate
        o = sigmoid(concat @ self.W_o + self.b_o)    # output gate

        # Cell state update: erase some old, add some new
        c_t = f * c_prev + i * c_hat                 # (hidden_dim,)

        # Hidden state: filtered view of cell state
        h_t = o * np.tanh(c_t)                       # (hidden_dim,)

        return h_t, c_t

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

One subtle but critical detail: we initialize the forget gate bias to 1 (self.b_f = np.ones(hidden_dim)). This means the forget gate starts close to σ(1) ≈ 0.73, so the LSTM defaults to remembering. Without this trick (discovered by Jozefowicz et al., 2015), LSTMs learn much more slowly because they start by erasing everything.

GRU: A Simpler Alternative

In 2014, Cho et al. asked: do we really need four gates? The Gated Recurrent Unit simplifies the LSTM by merging the forget and input gates into a single “update gate” and combining the cell state and hidden state into one vector. Two gates instead of four, fewer parameters, often comparable performance:

Reset gate: r_t = σ(W_r · [h_t−1, x_t]) — how much past to forget
Update gate: z_t = σ(W_z · [h_t−1, x_t]) — blend ratio of old vs new
Candidate: h̃_t = tanh(W_h · [r_t ⊙ h_t−1, x_t]) — proposed new state
Hidden state: h_t = (1 − z_t) ⊙ h_t−1 + z_t ⊙ h̃_t — interpolate old and new

The update gate z_t plays both roles: when z_t ≈ 0, the hidden state is preserved unchanged (like a forget gate of 1); when z_t ≈ 1, the hidden state is replaced with the candidate (like an input gate of 1). One gate, two jobs.

class GRUCell:
    """A single GRU cell — simpler than LSTM, often equally effective."""
    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        concat_dim = input_dim + hidden_dim
        scale = np.sqrt(2.0 / (concat_dim + hidden_dim))

        self.W_r = np.random.randn(concat_dim, hidden_dim) * scale  # reset
        self.W_z = np.random.randn(concat_dim, hidden_dim) * scale  # update
        self.W_h = np.random.randn(concat_dim, hidden_dim) * scale  # candidate

        self.b_r = np.zeros(hidden_dim)
        self.b_z = np.zeros(hidden_dim)
        self.b_h = np.zeros(hidden_dim)

    def forward(self, x_t, h_prev):
        """One timestep. Returns new hidden state."""
        concat = np.concatenate([h_prev, x_t])

        r = sigmoid(concat @ self.W_r + self.b_r)     # reset gate
        z = sigmoid(concat @ self.W_z + self.b_z)     # update gate

        # Candidate uses reset-gated version of previous hidden state
        concat_reset = np.concatenate([r * h_prev, x_t])
        h_hat = np.tanh(concat_reset @ self.W_h + self.b_h)

        # Interpolate between old and new
        h_t = (1 - z) * h_prev + z * h_hat            # (hidden_dim,)
        return h_t

Here's the parameter comparison across all three architectures:

Architecture	Weight Matrices	Parameters (d=128, input=64)	Gates
Vanilla RNN	W_xh, W_hh	24,704	0
GRU	W_r, W_z, W_h	74,112	2
LSTM	W_f, W_i, W_c, W_o	98,816	3 + cell

LSTM has ~4× the parameters of a vanilla RNN. GRU has ~3×. The extra cost buys you the ability to learn dependencies across hundreds of timesteps instead of tens. In practice, GRU and LSTM perform similarly on most tasks — GRU is preferred when parameter budget is tight, LSTM when you need maximum capacity.

Character-Level Text Generation

The best way to see the difference between these architectures is to train them on a task that requires memory: predicting the next character in a text sequence. This is the simplest possible language model — the same task GPT does with subword tokens, but at the character level.

The setup: encode each character as a one-hot vector, feed the sequence into the RNN one character at a time, and at each step predict the next character using a softmax over the vocabulary. The loss is cross-entropy, exactly as in the loss functions post.

def char_rnn_forward(cell, xs_onehot, W_out, b_out, h_init, c_init=None):
    """Forward pass for character-level language model.
    Returns outputs (logits), hidden states, and cell states (if LSTM).
    """
    hiddens, cells, outputs = [], [], []
    h = h_init
    c = c_init  # Only used for LSTM

    for x_t in xs_onehot:
        if c is not None:  # LSTM
            h, c = cell.forward(x_t, h, c)
            cells.append(c)
        else:  # Vanilla RNN or GRU
            h = cell.forward(x_t, h)

        hiddens.append(h)
        logits = h @ W_out + b_out                   # (vocab_size,)
        outputs.append(logits)

    return outputs, hiddens, cells

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

def cross_entropy_loss(logits_list, targets_onehot):
    """Average cross-entropy loss over all timesteps."""
    loss = 0.0
    for logits, target in zip(logits_list, targets_onehot):
        probs = softmax(logits)
        loss -= np.log(probs[target.argmax()] + 1e-12)
    return loss / len(logits_list)

# Training sketch (simplified):
# 1. Sample a random window from the text
# 2. One-hot encode the characters
# 3. Forward pass through the RNN
# 4. Compute cross-entropy loss at each timestep
# 5. BPTT to get gradients
# 6. Update weights with gradient descent
# 7. Repeat for thousands of iterations

What does each architecture learn? Given enough training on English text:

Vanilla RNN: learns character frequencies, common bigrams (th, he, in), and short words. Beyond ~10 characters of context, it loses coherence and generates plausible-looking but meaningless character sequences.
LSTM: learns word-level patterns, common phrases, bracket matching (opening and closing quotes), and maintains topic coherence for 50+ characters. Can generate recognizable English sentences.
GRU: similar quality to LSTM on most text. Slightly faster to train due to fewer parameters.

Andrej Karpathy's legendary 2015 blog post “The Unreasonable Effectiveness of Recurrent Neural Networks” showed character-level LSTMs generating Shakespeare, Wikipedia articles, LaTeX source code, and even C code with matched braces. The LSTM was learning structure far beyond what anyone expected from a character-level model.

Interactive: RNN Memory Trace

Type a sentence and watch the hidden state evolve character by character. Each column is one hidden dimension, color-coded from blue (negative) to red (positive). Compare how vanilla RNN states “drift” while LSTM states preserve specific information across long sequences.

The Bridge to Transformers

Now we can state precisely what's wrong with RNNs and why transformers exist.

The bottleneck problem. An RNN encodes a 100-word sentence into a single hidden vector — typically 256 to 1024 dimensions. Everything the model knows about the sentence must be compressed into that one vector. For short sentences this works fine. For long ones, early information gets overwritten. The LSTM slows the overwriting, but it can't stop it.

In 2014, Bahdanau, Cho, and Bengio proposed a radical fix: don't compress. Instead, let the decoder look back at all encoder hidden states and compute a weighted sum based on relevance. This is the attention mechanism, and it was originally designed for RNNs, not transformers.

Three years later, Vaswani et al. asked the obvious next question: if attention is doing all the heavy lifting, why keep the RNN at all? Strip out the recurrence entirely, keep only attention, add positional encoding to tell the model about token order (since it no longer gets position for free from sequential processing), and you have the transformer.

The payoff is parallelism. An RNN must process tokens sequentially — h_t depends on h_t−1, full stop. A transformer computes attention over all tokens simultaneously. The attention computation is O(n²) in sequence length (more work per step), but it's fully parallelizable across GPU cores. On modern hardware, doing more math in parallel is vastly faster than doing less math sequentially.

Property	RNN	Transformer
Processing	Sequential (h_t depends on h_t−1)	Parallel (all tokens at once)
Long-range deps	Vanishing gradient limits range	Direct attention at any distance
Position info	Free (from sequential order)	Must be added (positional encoding)
Inference memory	O(1) per token (fixed hidden state)	O(n) per token (growing KV cache)
Training compute	O(n) sequential, can't parallelize	O(n²) but fully parallelizable

One area where RNNs actually win: inference memory. An RNN processes one token and updates its fixed-size hidden state — O(1) memory regardless of sequence length. A transformer must store the KV cache for all previous tokens, which grows linearly with sequence length. This is why state space models like Mamba (Gu & Dao, 2023) revisit the recurrent approach — they achieve transformer-quality results with RNN-like O(1) inference cost. The story isn't over.

Interactive: Sequential vs Parallel Processing

See the fundamental difference between RNN and transformer processing. The RNN must process tokens one at a time (each depends on the previous). The transformer processes all tokens in parallel. Adjust the sequence length to see how the gap widens.

Sequence length: 12

Connections to the Series

RNNs connect to nearly every concept in the elementary series — in many cases, they motivated the very design choices we studied:

Micrograd — BPTT is standard backpropagation unrolled across timesteps; the same chain rule, applied across a deeper computational graph
Attention — attention was invented specifically to fix the RNN bottleneck (Bahdanau et al., 2014), then Vaswani et al. removed the RNN entirely
Positional Encoding — RNNs get position information for free from sequential processing; transformers need explicit positional encoding because they process all tokens in parallel
KV Cache — transformers need the KV cache for efficient autoregressive generation; RNNs are naturally autoregressive with O(1) state
Embeddings — character and word embeddings feed into RNNs the same way token embeddings feed into transformers
Loss Functions — cross-entropy loss for next-token prediction is identical whether the backbone is an RNN or transformer
Normalization — LayerNorm + residual connections in transformers solve the same gradient flow problem that LSTM gates were designed for
Softmax & Temperature — the same softmax distribution and temperature sampling apply to RNN outputs
Decoding Strategies — beam search and top-k sampling were originally developed for RNN-based sequence models
ConvNets — CNNs conquered space while RNNs conquered time; together they were the two pre-transformer paradigms for perception

References & Further Reading

Elman (1990) — Finding Structure in Time — the paper that introduced the simple recurrent network (Elman network)
Hochreiter & Schmidhuber (1997) — Long Short-Term Memory — the LSTM paper that solved the vanishing gradient problem for sequence modeling
Cho et al. (2014) — Learning Phrase Representations using RNN Encoder-Decoder — introduced the GRU as a simpler alternative to LSTM
Bahdanau, Cho & Bengio (2015) — Neural Machine Translation by Jointly Learning to Align and Translate — the paper that introduced attention for RNN-based machine translation
Vaswani et al. (2017) — Attention Is All You Need — removed the RNN entirely, keeping only attention
Pascanu, Mikolov & Bengio (2013) — On the difficulty of training recurrent neural networks — rigorous analysis of the vanishing and exploding gradient problem
Karpathy (2015) — The Unreasonable Effectiveness of Recurrent Neural Networks — the blog post that showed the world what character-level LSTMs can do
Gu & Dao (2023) — Mamba: Linear-Time Sequence Modeling with Selective State Spaces — the modern RNN revival, achieving transformer quality with recurrent-like O(1) inference