← Back to Blog

Decoding Strategies from Scratch: How LLMs Choose Their Next Word

The Last Mile of Language Generation

Every time you hit "send" in ChatGPT, the model does something remarkable. It takes your prompt, runs it through billions of parameters across dozens of attention layers, and arrives at a single question: what word comes next?

But here's the twist — it doesn't actually know. What it has is a probability distribution. A ranked list of every possible next token, each with a confidence score. The token "Paris" might get 0.42, "Lyon" might get 0.08, "France" might get 0.06, and 49,997 other tokens share the remaining mass.

The method you use to pick from that distribution — the decoding strategy — fundamentally shapes the output. The same model with the same prompt produces boring repetition, creative prose, or incoherent nonsense depending entirely on this one choice.

In our softmax & temperature post, we built the function that converts raw logits into probabilities, and we briefly touched on sampling strategies. Today we go deep. We'll implement every major decoding strategy from scratch in Python, see exactly why each one exists by watching the previous one fail, and build an interactive playground where you can watch these strategies compete in real time.

Our Toy Language Model

Real LLMs have vocabularies of 50,000–100,000 tokens and use neural networks to predict the next one. We don't need all that machinery to understand decoding. Instead, we'll build a tiny bigram model — a hand-crafted transition table that says "given the current word, here's how likely each next word is."

import numpy as np

vocab = ["the", "cat", "dog", "sat", "ran", "on", "a", "big",
         "small", "mat", "park", "happy", "and", "was", "red"]

# Transition logits: given a word, how likely is each next word?
# Only strong connections listed — everything else gets -1.0
transitions = {
    "the":   {"cat": 3.2, "dog": 2.8, "big": 1.5, "red": 1.2,
              "small": 1.0, "mat": 0.8, "park": 0.7},
    "cat":   {"sat": 3.5, "ran": 2.8, "was": 2.0, "and": 1.0},
    "dog":   {"ran": 3.0, "sat": 2.0, "was": 2.0, "and": 1.5},
    "sat":   {"on": 3.5, "and": 1.0, "happy": 0.3},
    "ran":   {"on": 1.5, "and": 1.0, "a": 0.5},
    "on":    {"the": 3.5, "a": 2.5, "big": 0.5},
    "a":     {"big": 2.5, "small": 2.0, "red": 1.8, "happy": 1.2,
              "cat": 0.8, "dog": 0.8, "mat": 0.5},
    "big":   {"cat": 2.5, "dog": 2.0, "mat": 1.5, "red": 0.8},
    "small": {"cat": 2.5, "dog": 2.0, "mat": 1.5, "red": 0.8},
    "mat":   {"and": 2.5, "was": 2.0},
    "park":  {"and": 2.0, "was": 1.5, "on": 0.8},
    "happy": {"cat": 2.5, "dog": 2.0, "and": 1.5},
    "and":   {"the": 3.0, "a": 2.5, "ran": 1.0, "sat": 0.8},
    "was":   {"happy": 2.0, "big": 1.5, "small": 1.2, "red": 1.0,
              "on": 0.5},
    "red":   {"cat": 2.5, "dog": 2.0, "mat": 1.8},
}

def get_logits(token):
    """Return logits for the next token. Unknown transitions get -1.0."""
    logits = np.full(len(vocab), -1.0)
    if token in transitions:
        for word, value in transitions[token].items():
            logits[vocab.index(word)] = value
    return logits

def softmax(logits, temperature=1.0):
    """Convert logits to probabilities with temperature scaling."""
    scaled = logits / temperature
    exp_vals = np.exp(scaled - np.max(scaled))  # numerical stability
    return exp_vals / exp_vals.sum()

This gives us a 15-word vocabulary with hand-crafted transition preferences. After "the", the model strongly prefers "cat" and "dog". After "cat", it expects "sat" or "ran". The get_logits function returns raw scores, and softmax converts them to probabilities — exactly like the final layer of a real language model. The -1.0 default means unspecified transitions are very unlikely but never truly impossible.

Greedy Decoding: Always Pick the Winner

The simplest possible strategy: at every step, pick the token with the highest probability. No randomness, no exploration — just argmax.

def greedy_decode(start_token, num_tokens=15):
    """Always pick the most probable next token."""
    tokens = [start_token]
    for _ in range(num_tokens):
        logits = get_logits(tokens[-1])
        probs = softmax(logits)
        next_idx = np.argmax(probs)
        tokens.append(vocab[next_idx])
    return tokens

print(" ".join(greedy_decode("the")))
# the cat sat on the cat sat on the cat sat on the cat sat on

And there it is — the classic failure mode. Greedy decoding falls into a repetition loop. "The cat sat on" has the highest probability at every step, so the model follows the same path forever. It's locally optimal (each individual token is the best choice) but globally terrible (the sequence is useless).

This happens because greedy decoding has no memory and no exploration. It can't say "I already wrote 'the cat sat on' — let me try something different." It just stares at the probabilities and picks the peak, every single time.

Greedy decoding still has its place — for factual Q&A or code generation, where there's one clearly correct answer, argmax is fast and reliable. But for anything requiring creativity or variety, we need randomness. What if we embraced it instead of running from it?

Pure Random Sampling: The Other Extreme

Instead of always picking the top token, what if we sampled randomly from the entire probability distribution? Each token gets selected in proportion to its probability — likely tokens appear often, unlikely tokens appear rarely, but everything has a shot.

def random_sample(start_token, num_tokens=15):
    """Sample from the full probability distribution."""
    tokens = [start_token]
    for _ in range(num_tokens):
        logits = get_logits(tokens[-1])
        probs = softmax(logits)
        next_idx = np.random.choice(len(vocab), p=probs)
        tokens.append(vocab[next_idx])
    return tokens

np.random.seed(42)
print(" ".join(random_sample("the")))
# Example: the dog and a happy park on big mat was on a red cat ran

Better — no repetition loop. But look carefully: "happy park"? "on big mat"? The text wanders into incoherent territory because even low-probability tokens occasionally get selected.

This is the long tail problem. In our 15-word vocabulary, the bottom 8 tokens after "the" share about 12% of the total probability. That sounds small, but over a long sequence, rare picks accumulate and the text drifts into nonsense. In a real LLM with 50,000 tokens, the long tail is massive — thousands of tokens that are individually unlikely but collectively hold substantial probability mass.

We need something between these extremes. A dial that lets us move smoothly from "boringly predictable" to "chaotically random."

Temperature: The Confidence Dial

We built temperature from scratch in the softmax post, so here's the quick recap: divide the logits by a temperature value T before applying softmax. Low temperature sharpens the distribution (the rich get richer), high temperature flattens it (everyone gets more equal).

def temperature_sample(start_token, num_tokens=15, temperature=1.0):
    """Sample with temperature-scaled probabilities."""
    tokens = [start_token]
    for _ in range(num_tokens):
        logits = get_logits(tokens[-1])
        probs = softmax(logits, temperature=temperature)
        next_idx = np.random.choice(len(vocab), p=probs)
        tokens.append(vocab[next_idx])
    return tokens

np.random.seed(7)
# Low temperature: almost greedy
print("T=0.3:", " ".join(temperature_sample("the", 15, 0.3)))
# T=0.3: the cat sat on the cat sat on the cat sat on the cat sat on

# Medium temperature: balanced
print("T=0.8:", " ".join(temperature_sample("the", 15, 0.8)))
# T=0.8: the cat ran and a big dog sat on the dog was happy cat and

# High temperature: chaotic
print("T=1.5:", " ".join(temperature_sample("the", 15, 1.5)))
# T=1.5: the mat ran park and happy red small sat happy on the cat mat a

Temperature gives us a useful spectrum. At T=0.3, the distribution is so sharp that it's essentially greedy. At T=0.8, we get nice variety while keeping coherence. At T=1.5, the flattened distribution lets too many unlikely tokens through.

But temperature has a fundamental limitation: it's a global knob. The same temperature applies to every prediction, regardless of context. When the model sees "The capital of France is ___" it's very confident ("Paris" at 95%), and low temperature is perfect. But when it sees "The sunset was ___", there are genuinely many good options ("beautiful", "stunning", "red", "fading"), and higher temperature helps.

You can't set one temperature that handles both cases well. What we really want is to truncate the distribution — chop off the unreliable tail and only sample from the plausible tokens.

Top-k Sampling: Cutting the Tail

Top-k sampling takes a simple approach to truncation: keep only the k most probable tokens, zero out everything else, and renormalize so the remaining probabilities sum to 1. This was popularized by Fan et al. in 2018 and became the default sampling strategy for GPT-2's release.

def top_k_sample(start_token, num_tokens=15, k=5, temperature=1.0):
    """Sample from only the top-k most probable tokens."""
    tokens = [start_token]
    for _ in range(num_tokens):
        logits = get_logits(tokens[-1])
        probs = softmax(logits, temperature=temperature)

        # Zero out everything except the top k
        top_k_idx = np.argsort(probs)[-k:]
        filtered = np.zeros_like(probs)
        filtered[top_k_idx] = probs[top_k_idx]
        filtered /= filtered.sum()  # renormalize

        next_idx = np.random.choice(len(vocab), p=filtered)
        tokens.append(vocab[next_idx])
    return tokens

np.random.seed(7)
print("k=3:", " ".join(top_k_sample("the", 15, k=3)))
# k=3: the cat ran and the dog sat on a big cat sat on a big

print("k=8:", " ".join(top_k_sample("the", 15, k=8)))
# k=8: the cat ran and a red dog was big mat and the big dog sat on

With k=3, we only consider the three most likely tokens at each step. The text stays coherent because garbage tokens are excluded, but we still get variety because we're sampling among the top candidates rather than always picking the winner.

But top-k has a subtle problem. Watch what happens with different distributions:

Scenario A: After "sat", our model is very confident — "on" has 73% probability. With k=8, we're including 7 tokens that the model thinks are very unlikely. We're adding noise to a confident prediction.

Scenario B: After "a", the model is uncertain — probability is spread across "big" (21%), "small" (17%), "red" (14%), "happy" (9%), and more. With k=3, we're chopping off perfectly valid options like "happy" and "cat". We're restricting a genuinely open-ended choice.

The fundamental problem: k is fixed, but the right number of candidates varies with every prediction. Sometimes the model needs 2 candidates, sometimes 20. A fixed window can't adapt. This exact frustration led to the most elegant solution in the decoding toolkit.

Top-p (Nucleus) Sampling: The Adaptive Threshold

In 2019, Ari Holtzman and colleagues published "The Curious Case of Neural Text Degeneration," a paper that changed how the entire industry generates text. Their key observation: the probability distributions that language models produce vary wildly in shape. Sometimes they're peaked (one clear winner), sometimes they're flat (many valid options). A good decoding strategy should adapt to this shape automatically.

Their solution — nucleus sampling — replaces the fixed count k with a cumulative probability threshold p. Sort the tokens by probability, then include tokens from the top until their cumulative probability reaches p. The set of included tokens is called the nucleus.

def nucleus_sample(start_token, num_tokens=15, p=0.9, temperature=1.0):
    """Sample from the smallest token set whose cumulative probability exceeds p."""
    tokens = [start_token]
    for _ in range(num_tokens):
        logits = get_logits(tokens[-1])
        probs = softmax(logits, temperature=temperature)

        # Sort by probability, descending
        sorted_idx = np.argsort(probs)[::-1]
        sorted_probs = probs[sorted_idx]

        # Find nucleus: smallest set with cumulative prob >= p
        cumulative = np.cumsum(sorted_probs)
        cutoff = np.searchsorted(cumulative, p) + 1

        # Keep only nucleus tokens
        nucleus_idx = sorted_idx[:cutoff]
        filtered = np.zeros_like(probs)
        filtered[nucleus_idx] = probs[nucleus_idx]
        filtered /= filtered.sum()

        next_idx = np.random.choice(len(vocab), p=filtered)
        tokens.append(vocab[next_idx])
    return tokens

np.random.seed(7)
print("p=0.9:", " ".join(nucleus_sample("the", 15, p=0.9)))
# p=0.9: the cat ran and the big dog was happy cat sat on a big cat ran

The magic is in how the nucleus size adapts automatically. Let's trace through two predictions to see it:

# After "sat": model is confident
probs_after_sat = softmax(get_logits("sat"))
# "on"=0.73, "and"=0.14, "happy"=0.03, ...
# Cumulative: 0.73, 0.87, 0.90 → nucleus = 3 tokens

# After "a": model is uncertain
probs_after_a = softmax(get_logits("a"))
# "big"=0.21, "small"=0.17, "red"=0.14, "happy"=0.09, "cat"=0.07, ...
# Cumulative: 0.21, 0.38, 0.52, 0.61, 0.68, 0.75, 0.82, 0.89, 0.91 → nucleus = 9 tokens

print(f"Nucleus after 'sat': ~3 tokens (confident)")
print(f"Nucleus after 'a':   ~9 tokens (uncertain)")

When the model is confident, the nucleus is tiny — maybe 2 or 3 tokens. When the model is genuinely uncertain, the nucleus expands to include all the plausible options. This is exactly the adaptive behavior that top-k couldn't provide.

In practice, p=0.9 to p=0.95 works well for most tasks. The nucleus captures the "core" of the distribution — all the tokens the model actually considered — while excluding the unreliable long tail that causes incoherent text. There's a beautiful connection to information theory here: nucleus sampling implicitly respects the entropy of each distribution, including more tokens when uncertainty is genuinely high.

Nucleus sampling elegantly solves the truncation problem. But what about a completely different approach — one that considers entire sequences rather than individual tokens?

Beam Search: Planning Ahead

Every strategy we've seen so far is myopic — it picks one token at a time without considering how that choice affects future tokens. Greedy picks the best single token, but that might lead to a dead end. Beam search takes a different approach: instead of committing to one path, it explores multiple paths simultaneously and picks the best sequence.

The algorithm maintains b partial sequences (called "beams"). At each step, it extends every beam with every possible next token, scores the resulting sequences by their total log-probability, and keeps only the top b candidates.

def beam_search(start_token, num_tokens=10, beam_width=3):
    """Find the most probable sequence by exploring multiple paths."""
    # Each beam: (token_list, cumulative_log_probability)
    beams = [([start_token], 0.0)]

    for step in range(num_tokens):
        candidates = []
        for tokens, score in beams:
            logits = get_logits(tokens[-1])
            probs = softmax(logits)
            log_probs = np.log(probs + 1e-10)

            for i, lp in enumerate(log_probs):
                new_seq = tokens + [vocab[i]]
                candidates.append((new_seq, score + lp))

        # Keep only the top beam_width candidates
        candidates.sort(key=lambda x: x[1], reverse=True)
        beams = candidates[:beam_width]

    return beams[0][0]  # return the highest-scoring beam

print(" ".join(beam_search("the", num_tokens=12, beam_width=3)))
# the cat sat on the cat sat on the cat sat on the

Wait — beam search also produces repetition? Yes, and for an interesting reason. Beam search optimizes for the most probable sequence, and in language models, repetitive patterns genuinely have high joint probability. "The cat sat on" repeated is a perfectly valid, high-probability sequence. Beam search finds it because that's literally its job — finding the globally most likely output.

In practice, beam search adds a length penalty to avoid favoring short sequences (shorter sequences have fewer probability terms multiplied together, so they naturally score higher). Production systems also use n-gram blocking to prevent repetition.

Beam search shines in tasks where there's one "right" answer: machine translation ("Translate: Le chat est noir"), image captioning, or summarization. It fell out of favor for open-ended generation because it's deterministic — same input always produces the same output — and because nucleus sampling produces more natural, diverse text. But understanding beam search matters because it reveals a fundamental insight: the best individual tokens don't always form the best sequence.

Repetition Penalties & the Modern Decoding Stack

We've seen that every strategy — even nucleus sampling — can sometimes produce repetitive text. The solution is a simple but effective hack: penalize tokens that have already appeared.

def sample_with_penalty(start_token, num_tokens=15, p=0.9,
                         temperature=0.8, repetition_penalty=1.3):
    """Nucleus sampling with repetition penalty."""
    tokens = [start_token]
    for _ in range(num_tokens):
        logits = get_logits(tokens[-1])

        # Penalize previously generated tokens
        for prev_token in set(tokens):
            idx = vocab.index(prev_token)
            if logits[idx] > 0:
                logits[idx] /= repetition_penalty  # reduce positive logits
            else:
                logits[idx] *= repetition_penalty  # push negative logits further down

        probs = softmax(logits, temperature=temperature)

        # Apply nucleus sampling
        sorted_idx = np.argsort(probs)[::-1]
        cumulative = np.cumsum(probs[sorted_idx])
        cutoff = np.searchsorted(cumulative, p) + 1
        nucleus_idx = sorted_idx[:cutoff]
        filtered = np.zeros_like(probs)
        filtered[nucleus_idx] = probs[nucleus_idx]
        filtered /= filtered.sum()

        next_idx = np.random.choice(len(vocab), p=filtered)
        tokens.append(vocab[next_idx])
    return tokens

np.random.seed(7)
print(" ".join(sample_with_penalty("the", 20)))
# the cat sat on a big dog ran and was happy small mat park red the cat and a small dog

The repetition penalty divides the logit of any previously-seen token by a penalty factor, making it less likely to appear again. Notice how the output now explores more of the vocabulary — "park" and "small" and "mat" all get their moment.

Real LLM APIs (like OpenAI's) offer three flavors of repetition control:

The modern production decoding stack typically combines temperature + nucleus sampling + penalties. Here's what the defaults look like across major providers, and what to adjust for different tasks:

Newer strategies continue to push the frontier. Min-p sampling scales the truncation threshold relative to the top token's probability, providing even better adaptation than top-p. Speculative decoding uses a small draft model to generate candidate tokens quickly, then has the large model verify them in parallel — same output quality, much faster generation.

Try It: The Decoding Playground

Watch how different strategies choose tokens from the same probability distribution. Click Step to generate one token at a time, or Auto to watch the model write continuously.

Candidates: 1 of 15 tokens Current token: "the"
the

Conclusion

We've traveled from the simplest possible strategy (pick the argmax) to the state of the art (nucleus sampling with penalties), and each step was motivated by a concrete failure of the previous approach:

  1. Greedy always picks the top token → falls into repetition loops
  2. Random sampling considers everything → long-tail tokens cause incoherence
  3. Temperature adjusts confidence globally → can't adapt to varying contexts
  4. Top-k truncates to a fixed window → too many candidates when confident, too few when uncertain
  5. Top-p (nucleus) adapts the window to the distribution's shape → the elegant solution
  6. Beam search optimizes whole sequences → deterministic, better suited for translation than conversation
  7. Repetition penalties discourage tokens from reappearing → the finishing touch

With this post, the inference pipeline is complete: tokenize → embed → attend → softmax → decode. We've now built every major piece of how a language model turns your prompt into a response, all from scratch. The next time you adjust a "temperature" slider or wonder why an LLM repeats itself, you'll know exactly what's happening under the hood.

References & Further Reading