Encoder-Decoder from Scratch

February 26, 2026 · Elementary AI · 14 min read

The Sequence-to-Sequence Problem

The transformer was born as an encoder-decoder. We just forgot.

When Vaswani et al. published "Attention Is All You Need" in 2017, the architecture had two halves: an encoder that reads the input and a decoder that writes the output. Then GPT came along and said "what if we just keep the decoder?" — and the simplification worked so well for language modeling that the encoder half quietly faded from mainstream attention. But the full encoder-decoder architecture never went away. It powers Whisper (speech recognition), T5 (text-to-text), BART (summarization), and every production machine translation system. This post builds it from scratch.

The fundamental challenge is called sequence-to-sequence (seq2seq): map a variable-length input to a variable-length output. Translation turns "the cat sat" into "le chat s'est assis" — three tokens in, five tokens out. Summarization condenses a paragraph into a sentence. Speech recognition converts thousands of audio frames into a few dozen words. The input and output can differ in length, structure, and even modality.

The breakthrough insight from Sutskever, Vinyals, and Le (2014) was deceptively simple: use one neural network (the encoder) to read the entire input and compress it into a fixed-length vector, then use a second network (the decoder) to generate the output token-by-token, conditioned on that vector. Two networks, one "context vector" bridging them.

Let's build the original LSTM-based seq2seq. We'll train it on a reversal task — input ABCD, output DCBA — because it's simple enough to fit in a code block but hard enough to expose the architecture's bottleneck.

import numpy as np

def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -15, 15)))
def tanh_act(x): return np.tanh(np.clip(x, -15, 15))

def lstm_cell(x, h, c, W, b):
    """Single LSTM step: x (input_dim,), h (hidden,), c (hidden,)"""
    concat = np.concatenate([x, h])          # (input_dim + hidden,)
    gates = W @ concat + b                    # (4*hidden,)
    hid = h.shape[0]
    f = sigmoid(gates[:hid])                  # forget gate
    i = sigmoid(gates[hid:2*hid])             # input gate
    o = sigmoid(gates[2*hid:3*hid])           # output gate
    g = tanh_act(gates[3*hid:])               # candidate
    c_new = f * c + i * g
    h_new = o * tanh_act(c_new)
    return h_new, c_new

def encode(sequence, emb, W_enc, b_enc, hidden_dim):
    """Read input sequence, return final hidden state."""
    h = np.zeros(hidden_dim)
    c = np.zeros(hidden_dim)
    for token in sequence:
        x = emb[token]
        h, c = lstm_cell(x, h, c, W_enc, b_enc)
    return h, c  # The "context vector"

def decode(context_h, context_c, target_len, emb, W_dec, b_dec, W_out, b_out, vocab_size):
    """Generate output from context vector using greedy decoding."""
    h, c = context_h, context_c
    outputs = []
    token = 0  # Start token
    for _ in range(target_len):
        x = emb[token]
        h, c = lstm_cell(x, h, c, W_dec, b_dec)
        logits = W_out @ h + b_out
        token = np.argmax(logits)
        outputs.append(token)
    return outputs

# Setup: vocab = {0: START, 1: A, 2: B, 3: C, 4: D, 5: E}
np.random.seed(42)
vocab_size, emb_dim, hidden_dim = 6, 8, 32
emb = np.random.randn(vocab_size, emb_dim) * 0.1
W_enc = np.random.randn(4*hidden_dim, emb_dim+hidden_dim) * 0.05
b_enc = np.zeros(4*hidden_dim)
W_dec = np.random.randn(4*hidden_dim, emb_dim+hidden_dim) * 0.05
b_dec = np.zeros(4*hidden_dim)
W_out = np.random.randn(vocab_size, hidden_dim) * 0.05
b_out = np.zeros(vocab_size)

# Test reversal: input [1,2,3,4] (ABCD) → should output [4,3,2,1] (DCBA)
inp = [1, 2, 3, 4]
h_ctx, c_ctx = encode(inp, emb, W_enc, b_enc, hidden_dim)
result = decode(h_ctx, c_ctx, 4, emb, W_dec, b_dec, W_out, b_out, vocab_size)

print(f"Context vector shape: {h_ctx.shape}")  # (32,) — entire input in one vector!
print(f"Input:  {inp}")
print(f"Output: {result}")  # Random before training — the architecture is correct
# After training: perfect at length 4, degrades at length 8+

The architecture works, but notice the pinch point: the entire input sequence gets squeezed into a single 32-dimensional vector. For our 4-token reversal, that's fine. But try length 16 and the context vector can't hold enough information. This is the bottleneck problem — and it motivated the single most important idea in deep learning.

Attention Solves the Bottleneck

In 2015, Bahdanau, Cho, and Bengio proposed an elegant fix: instead of forcing the entire input through a single vector, let the decoder look back at every encoder hidden state, every time it generates a token. This is cross-attention.

At each decoder step, we compute a similarity score between the current decoder hidden state and each encoder hidden state. Those scores become attention weights (via softmax), and we take a weighted sum of encoder states to form a dynamic context vector. The context changes at every step: when the decoder generates "le", it focuses on "the"; when it generates "chat", it shifts focus to "cat".

Bahdanau used additive attention: the score is computed by passing the decoder and encoder states through a small neural network.

score(h_dec, h_enc) = v^T tanh(W₁ h_enc + W₂ h_dec)

This looks more complex than dot-product attention, but the intuition is the same: how relevant is each input position to what I'm currently generating? The attention post covers the formal mechanics; here we see why it was invented — to fix the seq2seq bottleneck.

def attention(dec_hidden, enc_outputs, W1, W2, v):
    """Bahdanau additive attention.
    dec_hidden: (hidden,)  — current decoder state
    enc_outputs: (seq_len, hidden) — all encoder states
    Returns: context (hidden,), weights (seq_len,)
    """
    # Score each encoder position against decoder state
    # W1 @ enc + W2 @ dec, then through tanh, then v dot product
    scores = np.array([
        v @ tanh_act(W1 @ enc_h + W2 @ dec_hidden)
        for enc_h in enc_outputs
    ])
    weights = np.exp(scores - scores.max())
    weights = weights / weights.sum()           # softmax
    context = weights @ enc_outputs             # weighted sum
    return context, weights

def encode_all(sequence, emb, W_enc, b_enc, hidden_dim):
    """Encode and return ALL hidden states (not just the last one)."""
    h = np.zeros(hidden_dim)
    c = np.zeros(hidden_dim)
    all_h = []
    for token in sequence:
        x = emb[token]
        h, c = lstm_cell(x, h, c, W_enc, b_enc)
        all_h.append(h.copy())
    return np.array(all_h), h, c  # (seq_len, hidden), final_h, final_c

def decode_with_attention(enc_outputs, init_h, init_c, target_len,
                          emb, W_dec, b_dec, W_out, b_out,
                          W_a1, W_a2, v_a, vocab_size):
    """Decode with attention over encoder outputs."""
    h, c = init_h, init_c
    outputs, all_weights = [], []
    token = 0  # Start token
    for _ in range(target_len):
        # Attend to encoder outputs
        ctx, weights = attention(h, enc_outputs, W_a1, W_a2, v_a)
        all_weights.append(weights)
        # Concatenate attention context with token embedding
        x = np.concatenate([emb[token], ctx])
        h, c = lstm_cell(x, h, c, W_dec, b_dec)
        logits = W_out @ h + b_out
        token = np.argmax(logits)
        outputs.append(token)
    return outputs, np.array(all_weights)

# Setup attention parameters
attn_dim = 16
W_a1 = np.random.randn(attn_dim, hidden_dim) * 0.05
W_a2 = np.random.randn(attn_dim, hidden_dim) * 0.05
v_a = np.random.randn(attn_dim) * 0.05
# Decoder now takes emb_dim + hidden_dim as input (token + attention context)
W_dec_attn = np.random.randn(4*hidden_dim, emb_dim+hidden_dim+hidden_dim) * 0.05

# Encode with all states preserved
enc_states, h_final, c_final = encode_all([1,2,3,4], emb, W_enc, b_enc, hidden_dim)
print(f"Encoder outputs shape: {enc_states.shape}")  # (4, 32) — all states kept!
print(f"Bottleneck model: 1 vector of dim 32")
print(f"Attention model:  4 vectors of dim 32 — no information lost")

# After training, attention weights show alignment:
# Decoder step 0 (generating D) → attends to position 3 (D in input)
# Decoder step 1 (generating C) → attends to position 2 (C in input)
# This is the classic alignment heatmap from the Bahdanau paper

The difference is stark. The bottleneck model crams the input into a single vector of shape (32,). The attention model keeps all encoder states — shape (4, 32) — and lets the decoder dynamically select which ones matter. For a reversal task, the attention weights learn a clean diagonal pattern: to output position i, attend to input position n-i. For translation, the pattern follows the word alignment between languages.

This is the insight that changed everything: don't compress, attend. The attention mechanism is so powerful that it rendered the fixed context vector obsolete and paved the way for the transformer.

The Transformer Encoder-Decoder

Vaswani et al. (2017) took the next logical step: if attention is doing all the heavy lifting, why keep the LSTM at all? Replace the recurrent backbone entirely with self-attention layers, and you get the transformer encoder-decoder — the original architecture from "Attention Is All You Need."

The encoder uses self-attention: each input position attends to every other input position. There's no causal mask — the encoder is fully bidirectional. Token A can see token D and vice versa. This builds a rich, context-aware representation of the entire input.

The decoder is more complex. Each layer has two attention mechanisms:

Masked self-attention — the decoder attends to previous output positions only (causal mask, just like in a decoder-only transformer)
Cross-attention — the decoder attends to all encoder positions, using encoder outputs as keys and values

Why two attentions? Self-attention lets the decoder build coherent output (each generated token considers all previous tokens). Cross-attention grounds that output in the input (each generated token can reach back to any input token). They serve fundamentally different roles.

The information flow is: input tokens → encoder (bidirectional self-attention, N layers) → encoder hidden states → decoder cross-attends to those states while generating output autoregressively. Let's build it.

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def layer_norm(x, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(var + eps)

def multihead_attention(Q, K, V, W_q, W_k, W_v, W_o, n_heads, mask=None):
    """Multi-head attention. Q,K,V: (seq_len, d_model)"""
    d_model = Q.shape[-1]
    d_head = d_model // n_heads
    q = (Q @ W_q).reshape(-1, n_heads, d_head).transpose(1,0,2)  # (heads, seq, d_head)
    k = (K @ W_k).reshape(-1, n_heads, d_head).transpose(1,0,2)
    v = (V @ W_v).reshape(-1, n_heads, d_head).transpose(1,0,2)
    scores = q @ k.transpose(0,2,1) / np.sqrt(d_head)            # (heads, seq_q, seq_k)
    if mask is not None:
        scores = scores + mask  # mask is -inf where blocked
    weights = softmax(scores, axis=-1)
    out = (weights @ v).transpose(1,0,2).reshape(-1, d_model)     # (seq_q, d_model)
    return out @ W_o, weights[0]  # Return first head's weights for visualization

def ffn(x, W1, b1, W2, b2):
    return np.maximum(0, x @ W1 + b1) @ W2 + b2  # ReLU activation

def encoder_layer(x, params):
    """One encoder layer: self-attention + FFN, both with residual + layer norm."""
    attn_out, _ = multihead_attention(x, x, x,
        params['Wq_s'], params['Wk_s'], params['Wv_s'], params['Wo_s'],
        n_heads=2)
    x = layer_norm(x + attn_out)               # Residual + norm
    ff_out = ffn(x, params['W1'], params['b1'], params['W2'], params['b2'])
    return layer_norm(x + ff_out)               # Residual + norm

def decoder_layer(x, enc_out, params, causal_mask):
    """One decoder layer: masked self-attn + cross-attn + FFN."""
    # 1. Masked self-attention (decoder attends to itself, causally)
    self_attn, _ = multihead_attention(x, x, x,
        params['Wq_s'], params['Wk_s'], params['Wv_s'], params['Wo_s'],
        n_heads=2, mask=causal_mask)
    x = layer_norm(x + self_attn)
    # 2. Cross-attention (decoder queries, encoder keys/values)
    cross_attn, cross_weights = multihead_attention(x, enc_out, enc_out,
        params['Wq_c'], params['Wk_c'], params['Wv_c'], params['Wo_c'],
        n_heads=2)
    x = layer_norm(x + cross_attn)
    # 3. Feed-forward
    ff_out = ffn(x, params['W1'], params['b1'], params['W2'], params['b2'])
    return layer_norm(x + ff_out), cross_weights

def make_causal_mask(seq_len):
    mask = np.full((seq_len, seq_len), -1e9)
    return np.triu(mask, k=1)  # Upper triangle = blocked

# Initialize a small transformer encoder-decoder
d_model, n_heads = 32, 2
d_ff = 64

def init_attn_params():
    return {
        'Wq_s': np.random.randn(d_model, d_model)*0.05,
        'Wk_s': np.random.randn(d_model, d_model)*0.05,
        'Wv_s': np.random.randn(d_model, d_model)*0.05,
        'Wo_s': np.random.randn(d_model, d_model)*0.05,
        'W1': np.random.randn(d_model, d_ff)*0.05, 'b1': np.zeros(d_ff),
        'W2': np.random.randn(d_ff, d_model)*0.05, 'b2': np.zeros(d_model),
    }

def init_decoder_params():
    p = init_attn_params()
    p.update({  # Add cross-attention weights
        'Wq_c': np.random.randn(d_model, d_model)*0.05,
        'Wk_c': np.random.randn(d_model, d_model)*0.05,
        'Wv_c': np.random.randn(d_model, d_model)*0.05,
        'Wo_c': np.random.randn(d_model, d_model)*0.05,
    })
    return p

# Run one forward pass
enc_params = init_attn_params()
dec_params = init_decoder_params()
emb_table = np.random.randn(vocab_size, d_model) * 0.1

# Encode: input tokens → bidirectional self-attention (no mask!)
enc_input = emb_table[[1,2,3,4]]                        # ABCD embeddings
enc_output = encoder_layer(enc_input, enc_params)        # (4, 32)

# Decode: output tokens → masked self-attention + cross-attention to encoder
dec_input = emb_table[[0,4,3,2]]                         # START,D,C,B (teacher forcing)
mask = make_causal_mask(4)
dec_output, cross_wts = decoder_layer(dec_input, enc_output, dec_params, mask)

print(f"Encoder output: {enc_output.shape}")   # (4, 32) — bidirectional repr
print(f"Decoder output: {dec_output.shape}")   # (4, 32) — with cross-attention
print(f"Cross-attention weights shape: {cross_wts.shape}")  # (4, 4) — decoder→encoder
print(f"\nKey difference from decoder-only:")
print(f"  Encoder sees ALL positions (bidirectional)")
print(f"  Decoder sees past positions (causal) + ALL encoder positions (cross-attn)")

Look at the shapes. The encoder produces a (4, 32) matrix — one rich, bidirectional representation per input token. The decoder produces (4, 32) too, but each of those vectors was built from three information sources: (1) the output token embedding, (2) masked self-attention over previous output tokens, and (3) cross-attention over all encoder states. Compare this to a decoder-only transformer, which has only sources (1) and (2).

The cross-attention weights are a (4, 4) matrix — for each decoder position, how much it attended to each encoder position. This is exactly the alignment matrix from Bahdanau, now computed with dot-product attention inside a transformer.

Encoder-Decoder vs Decoder-Only

This is the question that shaped modern AI: if the encoder-decoder works, why did GPT drop the encoder?

A decoder-only model (GPT, LLaMA, Claude) handles seq2seq by concatenating input and output into a single sequence: [input tokens | SEP | output tokens], all processed left-to-right with causal attention. The "input" tokens can attend to each other, but only in one direction — token 3 can see tokens 1 and 2, but not token 4. There's no separate encoder, no cross-attention, just one unified sequence.

An encoder-decoder model (T5, BART, Whisper) processes the input bidirectionally in the encoder — every token sees every other token — then generates output with cross-attention grounding. Two separate stacks, two attention patterns.

The tradeoff comes down to this:

Aspect	Encoder-Decoder	Decoder-Only
Input understanding	Bidirectional — full context	Unidirectional (causal mask)
Architecture	Two stacks + cross-attention	One stack — simpler
Scaling	More complex to scale	Scales elegantly
Best for	Translation, ASR, summarization	Open-ended generation, reasoning
KV caching	Encoder cached once, reused	Single growing cache
Major models	T5, BART, Whisper, NLLB	GPT-4, Claude, LLaMA

Let's see the difference in code. A decoder-only model handles translation by treating it as a continuation task — prefix the input, then generate:

def prefix_lm_forward(input_tokens, output_tokens, emb, params, d_model):
    """Decoder-only approach to seq2seq: concatenate input + output."""
    # Combine into single sequence: [input | output]
    full_seq = input_tokens + output_tokens
    x = emb[full_seq]                                    # (inp+out, d_model)
    mask = make_causal_mask(len(full_seq))                # Causal: left-to-right
    # Self-attention only — no cross-attention, no encoder
    attn_out, _ = multihead_attention(x, x, x,
        params['Wq_s'], params['Wk_s'], params['Wv_s'], params['Wo_s'],
        n_heads=2, mask=mask)
    out = layer_norm(x + attn_out)
    ff_out = ffn(out, params['W1'], params['b1'], params['W2'], params['b2'])
    return layer_norm(out + ff_out)

def enc_dec_forward(input_tokens, output_tokens, emb, enc_params, dec_params):
    """Encoder-decoder approach: encode bidirectionally, decode with cross-attention."""
    # Encoder: bidirectional (no mask)
    enc_out = encoder_layer(emb[input_tokens], enc_params)
    # Decoder: causal self-attention + cross-attention to encoder
    dec_mask = make_causal_mask(len(output_tokens))
    dec_out, cross_wts = decoder_layer(emb[output_tokens], enc_out, dec_params, dec_mask)
    return dec_out, cross_wts

# Compare on same task: reverse [1,2,3,4] → [4,3,2,1]
inp, out = [1,2,3,4], [0,4,3,2]  # START + target

# Decoder-only: sees input causally (token 4 can't see tokens that come after it)
prefix_out = prefix_lm_forward(inp, out, emb_table, init_attn_params(), d_model)
print(f"Decoder-only output: {prefix_out.shape}")  # (8, 32) — full concat

# Encoder-decoder: encoder sees everything, decoder cross-attends
ed_out, ed_wts = enc_dec_forward(inp, out, emb_table, enc_params, dec_params)
print(f"Encoder-decoder output: {ed_out.shape}")   # (4, 32) — just decoder output
print(f"Cross-attention matrix: {ed_wts.shape}")    # (4, 4)

print(f"\nThe key difference:")
print(f"  Decoder-only: input token 1 sees [{1}] — just itself")
print(f"  Encoder:      input token 1 sees [{1},{2},{3},{4}] — ALL tokens")
print(f"  This bidirectionality is why encoder-decoder wins for translation")

The critical difference is what the input tokens can see. In the decoder-only model, input token 1 can only see itself (causal mask). In the encoder, input token 1 sees all four input tokens simultaneously. For translation, this matters enormously — the meaning of a word often depends on what comes after it ("bank" means something different in "river bank" vs "bank account"). The encoder captures this bidirectional context; the decoder-only model has to approximate it through massive scale.

So why did decoder-only win for LLMs? Three reasons: (1) simpler architecture means easier scaling, (2) open-ended generation doesn't have a clear "input" to encode separately, and (3) with enough parameters, causal attention can implicitly learn bidirectional-like representations. But for structured tasks with clear input/output boundaries — translation, speech recognition, summarization — the encoder-decoder's explicit separation still wins.

Try It: Encoder-Decoder Visualizer

Type a short sequence and watch tokens flow through the encoder (bidirectional self-attention), then see the decoder generate output token-by-token with cross-attention arrows reaching back to the encoder.

Self-attention Cross-attention

Phase Ready

Encoder steps 0

Decoder steps 0

Cross-attention focus —

The Encoder-Decoder Zoo

The encoder-decoder paradigm didn't stay in academia. It powers some of the most widely deployed AI systems in the world.

T5 (Raffel et al., 2020) showed that every NLP task can be framed as text-to-text seq2seq. Classification? Input: "classify: I love this movie" → Output: "positive". Translation? Input: "translate English to German: The house is big" → Output: "Das Haus ist groß". The encoder builds a bidirectional representation of the input prompt, and the decoder generates the answer. T5 was pre-trained with span corruption: randomly mask spans of tokens, then predict the missing spans. This denoising objective teaches the encoder to understand context and the decoder to generate completions — a natural fit for the architecture.

BART (Lewis et al., 2020) took a similar approach but with more aggressive corruption: token masking, deletion, sentence permutation, and text rotation. BART's encoder learns to understand corrupted text, and its decoder learns to reconstruct the original. This makes BART particularly good at summarization — the encoder processes the full document bidirectionally, and the decoder generates a concise summary grounded in that representation.

Whisper (Radford et al., 2023) is perhaps the most elegant use of encoder-decoder. The input is audio (mel spectrogram frames) and the output is text. These are fundamentally different modalities with different sequence lengths — exactly the scenario where a separate encoder shines. The encoder processes the audio with bidirectional self-attention, building a rich audio representation. The decoder generates the transcript token-by-token, cross-attending to the audio features. You can't easily do this with a decoder-only model because the audio and text "tokens" live in completely different spaces.

NLLB (No Language Left Behind, 2022) scales encoder-decoder to 200+ languages for machine translation. The shared encoder-decoder architecture handles any language pair: same encoder, same decoder, different input/output languages. Cross-attention bridges the language gap.

The pattern is clear: encoder-decoder excels when input and output are structurally different — different modalities (audio → text), different languages (English → German), or different formats (document → summary).

Training Tricks

Training encoder-decoder models introduces challenges that don't exist in decoder-only models. The most important one is teacher forcing.

During training, the decoder generates tokens one at a time. At each step, it needs the previous token as input. But which "previous token"? The one the model predicted (which might be wrong) or the ground-truth token? If we feed the model's own predictions, early errors cascade — one wrong token derails the entire sequence. If we feed ground-truth tokens, the decoder sees perfect input during training but imperfect input during inference.

Teacher forcing feeds the ground-truth previous tokens during training. It's like a student practicing with the answer key visible — they learn fast but never experience mistakes. Without it, training barely converges. With it, there's a train/inference mismatch called exposure bias.

# Teacher forcing vs free-running decoding

def train_step_teacher_forcing(enc_out, target_tokens, params):
    """Training: feed ground-truth tokens to decoder."""
    dec_input = target_tokens[:-1]   # [START, t1, t2, ..., tn-1]
    dec_target = target_tokens[1:]   # [t1, t2, ..., tn]
    # Decoder always sees the CORRECT previous token
    # Fast convergence, but exposure bias at inference time
    return decode_step(enc_out, dec_input, params)

def train_step_scheduled_sampling(enc_out, target_tokens, params, epsilon):
    """Scheduled sampling: mix ground-truth and model predictions."""
    outputs = []
    prev_token = target_tokens[0]  # START
    for t in range(1, len(target_tokens)):
        use_truth = np.random.random() < epsilon
        if use_truth:
            prev_token = target_tokens[t-1]  # Ground truth
        pred = decode_one_step(enc_out, prev_token, params)
        outputs.append(pred)
        if not use_truth:
            prev_token = pred  # Use model's prediction next time
    return outputs

# Epsilon schedule: start at 1.0 (all teacher forcing), decay to 0.0
# This gradually exposes the model to its own mistakes
for epoch in range(100):
    epsilon = max(0.0, 1.0 - epoch / 80)  # Linear decay over 80 epochs
    # Early training: epsilon ≈ 1.0 → almost all teacher forcing
    # Late training:  epsilon ≈ 0.0 → almost all free-running

Scheduled sampling (Bengio et al., 2015) bridges the gap: start with 100% teacher forcing, then gradually replace ground-truth tokens with the model's own predictions. By the end of training, the model is comfortable with its own imperfect inputs. Think of it as removing the training wheels slowly.

Two other tricks matter for encoder-decoder training. Label smoothing softens the target from a hard one-hot vector to a slightly uniform distribution (e.g., 90% on the correct token, 10% spread across all others). This prevents the model from becoming overconfident and improves generalization. And cross-attention caching: at inference time, the encoder runs once and its key/value matrices are cached, then reused at every decoder step. This is why encoder-decoder can be faster than decoder-only for long inputs — the encoder computation is amortized. The KV cache post covers the caching mechanics in detail.

When to Use What

After building both architectures from scratch, here's the practical decision framework:

Use encoder-decoder when:

Input and output are different modalities (audio → text, image → caption)
Input needs bidirectional understanding (translation, where word order differs)
The task is well-defined with clear input/output boundaries
Input is processed once, output is generated many tokens (cache the encoder)

Use decoder-only when:

Open-ended generation (chatbots, creative writing, code generation)
Instruction following and reasoning (the "input" is a flexible prompt)
Scale matters more than architectural specialization
You want one architecture for many tasks (the GPT approach)

The modern landscape is converging. Some decoder-only models like Gemini handle multimodal inputs by encoding images and audio into "tokens" that fit the causal attention pattern. But the encoder-decoder isn't obsolete — it's specialized. Whisper is still an encoder-decoder. So is every production translation system. The architecture persists because some problems genuinely benefit from bidirectional encoding plus cross-attention grounding.

Try It: Architecture Race

Watch encoder-decoder and decoder-only train simultaneously on a sequence reversal task. Adjust sequence length to see where each architecture shines.

Seq length 6

Epoch 0

Enc-Dec accuracy 0%

Dec-only accuracy 0%

Winner —

Conclusion

The transformer was born as an encoder-decoder in "Attention Is All You Need." The decoder-only simplification (GPT) proved spectacularly effective for language modeling, but the full architecture never disappeared — it just specialized.

We traced the historical arc: LSTM seq2seq introduced the encoder-decoder paradigm with a fixed context vector. Attention solved the bottleneck by letting the decoder look back at all encoder states. The transformer replaced LSTMs with self-attention and cross-attention. And finally, the decoder-only model asked "what if we skip the encoder?" — and for open-ended generation, the answer was "it works brilliantly."

But Whisper, T5, BART, and every translation system prove that when input and output are structurally different, the full encoder-decoder still reigns. The two attention types tell the whole story: self-attention builds rich representations within a sequence; cross-attention bridges two separate worlds. Understanding both architectures — and when each shines — is what separates someone who uses transformers from someone who truly understands them.

Cross-References

Attention from Scratch — Self-attention is the foundation; this post shows why cross-attention was invented to bridge encoder and decoder.
Attention Variants from Scratch — Cross-attention gets its formal treatment there; here we build the full architecture around it.
Transformer from Scratch — Covers decoder-only assembly; this post adds the encoder half and the cross-attention bridge.
Positional Encoding from Scratch — Encoder and decoder need separate position encodings; encoder positions are absolute, decoder positions are causal.
RoPE from Scratch — Rotary embeddings apply differently in self-attention vs cross-attention contexts.
Decoding Strategies from Scratch — Beam search was invented for seq2seq; this post provides the architecture that motivated it.
KV Cache from Scratch — Encoder outputs are cached once and reused at every decoder step, unlike self-attention KV which grows.
Flash Attention from Scratch — Cross-attention has different memory access patterns; tiling strategies differ from self-attention.
Embeddings from Scratch — Shared vs separate encoder/decoder embedding spaces (weight tying reduces parameters).
Tokenization from Scratch — Source and target may use different tokenizers/vocabularies for multilingual models.
Knowledge Distillation from Scratch — Seq2seq distillation uses sequence-level knowledge distillation (Kim & Rush, 2016), not just logit matching.
Normalization from Scratch — Pre-norm vs post-norm in encoder-decoder stacks affects training stability differently.

References & Further Reading

Sutskever, Vinyals & Le (2014) — Sequence to Sequence Learning with Neural Networks — The original LSTM seq2seq paper that introduced the encoder-decoder paradigm for neural machine translation.
Bahdanau, Cho & Bengio (2015) — Neural Machine Translation by Jointly Learning to Align and Translate — Introduced the attention mechanism to fix the fixed-context-vector bottleneck.
Vaswani et al. (2017) — Attention Is All You Need — The original transformer paper — an encoder-decoder architecture that replaced RNNs with self-attention.
Raffel et al. (2020) — Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — T5: every NLP task as text-to-text sequence-to-sequence.
Lewis et al. (2020) — BART: Denoising Sequence-to-Sequence Pre-training — Denoising autoencoder approach to pre-training encoder-decoder models.
Radford et al. (2023) — Robust Speech Recognition via Large-Scale Weak Supervision — Whisper: audio encoder + text decoder for speech recognition.
Luong, Pham & Manning (2015) — Effective Approaches to Attention-based Neural Machine Translation — Dot-product (multiplicative) attention as an alternative to Bahdanau's additive attention.
Bengio et al. (2015) — Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks — Bridges the teacher-forcing/free-running gap during training.
Kim & Rush (2016) — Sequence-Level Knowledge Distillation — Knowledge distillation adapted for sequence-to-sequence models.
NLLB Team (2022) — No Language Left Behind — Scaling encoder-decoder translation to 200+ languages.