Decoding Strategies from Scratch: How LLMs Choose Their Next Word
The Last Mile of Language Generation
Every time you hit "send" in ChatGPT, the model does something remarkable. It takes your prompt, runs it through billions of parameters across dozens of attention layers, and arrives at a single question: what word comes next?
But here's the twist — it doesn't actually know. What it has is a probability distribution. A ranked list of every possible next token, each with a confidence score. The token "Paris" might get 0.42, "Lyon" might get 0.08, "France" might get 0.06, and 49,997 other tokens share the remaining mass.
The method you use to pick from that distribution — the decoding strategy — fundamentally shapes the output. The same model with the same prompt produces boring repetition, creative prose, or incoherent nonsense depending entirely on this one choice.
In our softmax & temperature post, we built the function that converts raw logits into probabilities, and we briefly touched on sampling strategies. Today we go deep. We'll implement every major decoding strategy from scratch in Python, see exactly why each one exists by watching the previous one fail, and build an interactive playground where you can watch these strategies compete in real time.
Our Toy Language Model
Real LLMs have vocabularies of 50,000–100,000 tokens and use neural networks to predict the next one. We don't need all that machinery to understand decoding. Instead, we'll build a tiny bigram model — a hand-crafted transition table that says "given the current word, here's how likely each next word is."
import numpy as np
vocab = ["the", "cat", "dog", "sat", "ran", "on", "a", "big",
"small", "mat", "park", "happy", "and", "was", "red"]
# Transition logits: given a word, how likely is each next word?
# Only strong connections listed — everything else gets -1.0
transitions = {
"the": {"cat": 3.2, "dog": 2.8, "big": 1.5, "red": 1.2,
"small": 1.0, "mat": 0.8, "park": 0.7},
"cat": {"sat": 3.5, "ran": 2.8, "was": 2.0, "and": 1.0},
"dog": {"ran": 3.0, "sat": 2.0, "was": 2.0, "and": 1.5},
"sat": {"on": 3.5, "and": 1.0, "happy": 0.3},
"ran": {"on": 1.5, "and": 1.0, "a": 0.5},
"on": {"the": 3.5, "a": 2.5, "big": 0.5},
"a": {"big": 2.5, "small": 2.0, "red": 1.8, "happy": 1.2,
"cat": 0.8, "dog": 0.8, "mat": 0.5},
"big": {"cat": 2.5, "dog": 2.0, "mat": 1.5, "red": 0.8},
"small": {"cat": 2.5, "dog": 2.0, "mat": 1.5, "red": 0.8},
"mat": {"and": 2.5, "was": 2.0},
"park": {"and": 2.0, "was": 1.5, "on": 0.8},
"happy": {"cat": 2.5, "dog": 2.0, "and": 1.5},
"and": {"the": 3.0, "a": 2.5, "ran": 1.0, "sat": 0.8},
"was": {"happy": 2.0, "big": 1.5, "small": 1.2, "red": 1.0,
"on": 0.5},
"red": {"cat": 2.5, "dog": 2.0, "mat": 1.8},
}
def get_logits(token):
"""Return logits for the next token. Unknown transitions get -1.0."""
logits = np.full(len(vocab), -1.0)
if token in transitions:
for word, value in transitions[token].items():
logits[vocab.index(word)] = value
return logits
def softmax(logits, temperature=1.0):
"""Convert logits to probabilities with temperature scaling."""
scaled = logits / temperature
exp_vals = np.exp(scaled - np.max(scaled)) # numerical stability
return exp_vals / exp_vals.sum()
This gives us a 15-word vocabulary with hand-crafted transition preferences. After "the", the model strongly prefers "cat" and "dog". After "cat", it expects "sat" or "ran". The get_logits function returns raw scores, and softmax converts them to probabilities — exactly like the final layer of a real language model. The -1.0 default means unspecified transitions are very unlikely but never truly impossible.
Greedy Decoding: Always Pick the Winner
The simplest possible strategy: at every step, pick the token with the highest probability. No randomness, no exploration — just argmax.
def greedy_decode(start_token, num_tokens=15):
"""Always pick the most probable next token."""
tokens = [start_token]
for _ in range(num_tokens):
logits = get_logits(tokens[-1])
probs = softmax(logits)
next_idx = np.argmax(probs)
tokens.append(vocab[next_idx])
return tokens
print(" ".join(greedy_decode("the")))
# the cat sat on the cat sat on the cat sat on the cat sat on
And there it is — the classic failure mode. Greedy decoding falls into a repetition loop. "The cat sat on" has the highest probability at every step, so the model follows the same path forever. It's locally optimal (each individual token is the best choice) but globally terrible (the sequence is useless).
This happens because greedy decoding has no memory and no exploration. It can't say "I already wrote 'the cat sat on' — let me try something different." It just stares at the probabilities and picks the peak, every single time.
Greedy decoding still has its place — for factual Q&A or code generation, where there's one clearly correct answer, argmax is fast and reliable. But for anything requiring creativity or variety, we need randomness. What if we embraced it instead of running from it?
Pure Random Sampling: The Other Extreme
Instead of always picking the top token, what if we sampled randomly from the entire probability distribution? Each token gets selected in proportion to its probability — likely tokens appear often, unlikely tokens appear rarely, but everything has a shot.
def random_sample(start_token, num_tokens=15):
"""Sample from the full probability distribution."""
tokens = [start_token]
for _ in range(num_tokens):
logits = get_logits(tokens[-1])
probs = softmax(logits)
next_idx = np.random.choice(len(vocab), p=probs)
tokens.append(vocab[next_idx])
return tokens
np.random.seed(42)
print(" ".join(random_sample("the")))
# Example: the dog and a happy park on big mat was on a red cat ran
Better — no repetition loop. But look carefully: "happy park"? "on big mat"? The text wanders into incoherent territory because even low-probability tokens occasionally get selected.
This is the long tail problem. In our 15-word vocabulary, the bottom 8 tokens after "the" share about 12% of the total probability. That sounds small, but over a long sequence, rare picks accumulate and the text drifts into nonsense. In a real LLM with 50,000 tokens, the long tail is massive — thousands of tokens that are individually unlikely but collectively hold substantial probability mass.
We need something between these extremes. A dial that lets us move smoothly from "boringly predictable" to "chaotically random."
Temperature: The Confidence Dial
We built temperature from scratch in the softmax post, so here's the quick recap: divide the logits by a temperature value T before applying softmax. Low temperature sharpens the distribution (the rich get richer), high temperature flattens it (everyone gets more equal).
def temperature_sample(start_token, num_tokens=15, temperature=1.0):
"""Sample with temperature-scaled probabilities."""
tokens = [start_token]
for _ in range(num_tokens):
logits = get_logits(tokens[-1])
probs = softmax(logits, temperature=temperature)
next_idx = np.random.choice(len(vocab), p=probs)
tokens.append(vocab[next_idx])
return tokens
np.random.seed(7)
# Low temperature: almost greedy
print("T=0.3:", " ".join(temperature_sample("the", 15, 0.3)))
# T=0.3: the cat sat on the cat sat on the cat sat on the cat sat on
# Medium temperature: balanced
print("T=0.8:", " ".join(temperature_sample("the", 15, 0.8)))
# T=0.8: the cat ran and a big dog sat on the dog was happy cat and
# High temperature: chaotic
print("T=1.5:", " ".join(temperature_sample("the", 15, 1.5)))
# T=1.5: the mat ran park and happy red small sat happy on the cat mat a
Temperature gives us a useful spectrum. At T=0.3, the distribution is so sharp that it's essentially greedy. At T=0.8, we get nice variety while keeping coherence. At T=1.5, the flattened distribution lets too many unlikely tokens through.
But temperature has a fundamental limitation: it's a global knob. The same temperature applies to every prediction, regardless of context. When the model sees "The capital of France is ___" it's very confident ("Paris" at 95%), and low temperature is perfect. But when it sees "The sunset was ___", there are genuinely many good options ("beautiful", "stunning", "red", "fading"), and higher temperature helps.
You can't set one temperature that handles both cases well. What we really want is to truncate the distribution — chop off the unreliable tail and only sample from the plausible tokens.
Top-k Sampling: Cutting the Tail
Top-k sampling takes a simple approach to truncation: keep only the k most probable tokens, zero out everything else, and renormalize so the remaining probabilities sum to 1. This was popularized by Fan et al. in 2018 and became the default sampling strategy for GPT-2's release.
def top_k_sample(start_token, num_tokens=15, k=5, temperature=1.0):
"""Sample from only the top-k most probable tokens."""
tokens = [start_token]
for _ in range(num_tokens):
logits = get_logits(tokens[-1])
probs = softmax(logits, temperature=temperature)
# Zero out everything except the top k
top_k_idx = np.argsort(probs)[-k:]
filtered = np.zeros_like(probs)
filtered[top_k_idx] = probs[top_k_idx]
filtered /= filtered.sum() # renormalize
next_idx = np.random.choice(len(vocab), p=filtered)
tokens.append(vocab[next_idx])
return tokens
np.random.seed(7)
print("k=3:", " ".join(top_k_sample("the", 15, k=3)))
# k=3: the cat ran and the dog sat on a big cat sat on a big
print("k=8:", " ".join(top_k_sample("the", 15, k=8)))
# k=8: the cat ran and a red dog was big mat and the big dog sat on
With k=3, we only consider the three most likely tokens at each step. The text stays coherent because garbage tokens are excluded, but we still get variety because we're sampling among the top candidates rather than always picking the winner.
But top-k has a subtle problem. Watch what happens with different distributions:
Scenario A: After "sat", our model is very confident — "on" has 73% probability. With k=8, we're including 7 tokens that the model thinks are very unlikely. We're adding noise to a confident prediction.
Scenario B: After "a", the model is uncertain — probability is spread across "big" (21%), "small" (17%), "red" (14%), "happy" (9%), and more. With k=3, we're chopping off perfectly valid options like "happy" and "cat". We're restricting a genuinely open-ended choice.
The fundamental problem: k is fixed, but the right number of candidates varies with every prediction. Sometimes the model needs 2 candidates, sometimes 20. A fixed window can't adapt. This exact frustration led to the most elegant solution in the decoding toolkit.
Top-p (Nucleus) Sampling: The Adaptive Threshold
In 2019, Ari Holtzman and colleagues published "The Curious Case of Neural Text Degeneration," a paper that changed how the entire industry generates text. Their key observation: the probability distributions that language models produce vary wildly in shape. Sometimes they're peaked (one clear winner), sometimes they're flat (many valid options). A good decoding strategy should adapt to this shape automatically.
Their solution — nucleus sampling — replaces the fixed count k with a cumulative probability threshold p. Sort the tokens by probability, then include tokens from the top until their cumulative probability reaches p. The set of included tokens is called the nucleus.
def nucleus_sample(start_token, num_tokens=15, p=0.9, temperature=1.0):
"""Sample from the smallest token set whose cumulative probability exceeds p."""
tokens = [start_token]
for _ in range(num_tokens):
logits = get_logits(tokens[-1])
probs = softmax(logits, temperature=temperature)
# Sort by probability, descending
sorted_idx = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_idx]
# Find nucleus: smallest set with cumulative prob >= p
cumulative = np.cumsum(sorted_probs)
cutoff = np.searchsorted(cumulative, p) + 1
# Keep only nucleus tokens
nucleus_idx = sorted_idx[:cutoff]
filtered = np.zeros_like(probs)
filtered[nucleus_idx] = probs[nucleus_idx]
filtered /= filtered.sum()
next_idx = np.random.choice(len(vocab), p=filtered)
tokens.append(vocab[next_idx])
return tokens
np.random.seed(7)
print("p=0.9:", " ".join(nucleus_sample("the", 15, p=0.9)))
# p=0.9: the cat ran and the big dog was happy cat sat on a big cat ran
The magic is in how the nucleus size adapts automatically. Let's trace through two predictions to see it:
# After "sat": model is confident
probs_after_sat = softmax(get_logits("sat"))
# "on"=0.73, "and"=0.14, "happy"=0.03, ...
# Cumulative: 0.73, 0.87, 0.90 → nucleus = 3 tokens
# After "a": model is uncertain
probs_after_a = softmax(get_logits("a"))
# "big"=0.21, "small"=0.17, "red"=0.14, "happy"=0.09, "cat"=0.07, ...
# Cumulative: 0.21, 0.38, 0.52, 0.61, 0.68, 0.75, 0.82, 0.89, 0.91 → nucleus = 9 tokens
print(f"Nucleus after 'sat': ~3 tokens (confident)")
print(f"Nucleus after 'a': ~9 tokens (uncertain)")
When the model is confident, the nucleus is tiny — maybe 2 or 3 tokens. When the model is genuinely uncertain, the nucleus expands to include all the plausible options. This is exactly the adaptive behavior that top-k couldn't provide.
In practice, p=0.9 to p=0.95 works well for most tasks. The nucleus captures the "core" of the distribution — all the tokens the model actually considered — while excluding the unreliable long tail that causes incoherent text. There's a beautiful connection to information theory here: nucleus sampling implicitly respects the entropy of each distribution, including more tokens when uncertainty is genuinely high.
Nucleus sampling elegantly solves the truncation problem. But what about a completely different approach — one that considers entire sequences rather than individual tokens?
Beam Search: Planning Ahead
Every strategy we've seen so far is myopic — it picks one token at a time without considering how that choice affects future tokens. Greedy picks the best single token, but that might lead to a dead end. Beam search takes a different approach: instead of committing to one path, it explores multiple paths simultaneously and picks the best sequence.
The algorithm maintains b partial sequences (called "beams"). At each step, it extends every beam with every possible next token, scores the resulting sequences by their total log-probability, and keeps only the top b candidates.
def beam_search(start_token, num_tokens=10, beam_width=3):
"""Find the most probable sequence by exploring multiple paths."""
# Each beam: (token_list, cumulative_log_probability)
beams = [([start_token], 0.0)]
for step in range(num_tokens):
candidates = []
for tokens, score in beams:
logits = get_logits(tokens[-1])
probs = softmax(logits)
log_probs = np.log(probs + 1e-10)
for i, lp in enumerate(log_probs):
new_seq = tokens + [vocab[i]]
candidates.append((new_seq, score + lp))
# Keep only the top beam_width candidates
candidates.sort(key=lambda x: x[1], reverse=True)
beams = candidates[:beam_width]
return beams[0][0] # return the highest-scoring beam
print(" ".join(beam_search("the", num_tokens=12, beam_width=3)))
# the cat sat on the cat sat on the cat sat on the
Wait — beam search also produces repetition? Yes, and for an interesting reason. Beam search optimizes for the most probable sequence, and in language models, repetitive patterns genuinely have high joint probability. "The cat sat on" repeated is a perfectly valid, high-probability sequence. Beam search finds it because that's literally its job — finding the globally most likely output.
In practice, beam search adds a length penalty to avoid favoring short sequences (shorter sequences have fewer probability terms multiplied together, so they naturally score higher). Production systems also use n-gram blocking to prevent repetition.
Beam search shines in tasks where there's one "right" answer: machine translation ("Translate: Le chat est noir"), image captioning, or summarization. It fell out of favor for open-ended generation because it's deterministic — same input always produces the same output — and because nucleus sampling produces more natural, diverse text. But understanding beam search matters because it reveals a fundamental insight: the best individual tokens don't always form the best sequence.
Repetition Penalties & the Modern Decoding Stack
We've seen that every strategy — even nucleus sampling — can sometimes produce repetitive text. The solution is a simple but effective hack: penalize tokens that have already appeared.
def sample_with_penalty(start_token, num_tokens=15, p=0.9,
temperature=0.8, repetition_penalty=1.3):
"""Nucleus sampling with repetition penalty."""
tokens = [start_token]
for _ in range(num_tokens):
logits = get_logits(tokens[-1])
# Penalize previously generated tokens
for prev_token in set(tokens):
idx = vocab.index(prev_token)
if logits[idx] > 0:
logits[idx] /= repetition_penalty # reduce positive logits
else:
logits[idx] *= repetition_penalty # push negative logits further down
probs = softmax(logits, temperature=temperature)
# Apply nucleus sampling
sorted_idx = np.argsort(probs)[::-1]
cumulative = np.cumsum(probs[sorted_idx])
cutoff = np.searchsorted(cumulative, p) + 1
nucleus_idx = sorted_idx[:cutoff]
filtered = np.zeros_like(probs)
filtered[nucleus_idx] = probs[nucleus_idx]
filtered /= filtered.sum()
next_idx = np.random.choice(len(vocab), p=filtered)
tokens.append(vocab[next_idx])
return tokens
np.random.seed(7)
print(" ".join(sample_with_penalty("the", 20)))
# the cat sat on a big dog ran and was happy small mat park red the cat and a small dog
The repetition penalty divides the logit of any previously-seen token by a penalty factor, making it less likely to appear again. Notice how the output now explores more of the vocabulary — "park" and "small" and "mat" all get their moment.
Real LLM APIs (like OpenAI's) offer three flavors of repetition control:
- Repetition penalty — scales logits of any token that appeared in the output
- Frequency penalty — penalty proportional to how many times a token appeared (repeat offenders get punished harder)
- Presence penalty — flat penalty for any token that appeared at all (binary: seen or not seen)
The modern production decoding stack typically combines temperature + nucleus sampling + penalties. Here's what the defaults look like across major providers, and what to adjust for different tasks:
- Factual Q&A / code:
temperature=0(or very low), no sampling needed. You want the single most likely answer. - General conversation:
temperature=0.7,top_p=0.9. Balanced diversity without going off the rails. - Creative writing:
temperature=0.9,top_p=0.95,presence_penalty=0.3. Encourage variety and exploration. - Brainstorming:
temperature=1.0–1.2,top_p=0.95,presence_penalty=0.6. Push toward novel territory.
Newer strategies continue to push the frontier. Min-p sampling scales the truncation threshold relative to the top token's probability, providing even better adaptation than top-p. Speculative decoding uses a small draft model to generate candidate tokens quickly, then has the large model verify them in parallel — same output quality, much faster generation.
Try It: The Decoding Playground
Watch how different strategies choose tokens from the same probability distribution. Click Step to generate one token at a time, or Auto to watch the model write continuously.
Conclusion
We've traveled from the simplest possible strategy (pick the argmax) to the state of the art (nucleus sampling with penalties), and each step was motivated by a concrete failure of the previous approach:
- Greedy always picks the top token → falls into repetition loops
- Random sampling considers everything → long-tail tokens cause incoherence
- Temperature adjusts confidence globally → can't adapt to varying contexts
- Top-k truncates to a fixed window → too many candidates when confident, too few when uncertain
- Top-p (nucleus) adapts the window to the distribution's shape → the elegant solution
- Beam search optimizes whole sequences → deterministic, better suited for translation than conversation
- Repetition penalties discourage tokens from reappearing → the finishing touch
With this post, the inference pipeline is complete: tokenize → embed → attend → softmax → decode. We've now built every major piece of how a language model turns your prompt into a response, all from scratch. The next time you adjust a "temperature" slider or wonder why an LLM repeats itself, you'll know exactly what's happening under the hood.
References & Further Reading
- Holtzman et al. — The Curious Case of Neural Text Degeneration (2019) — The paper that introduced nucleus (top-p) sampling and showed why maximization-based decoding degenerates
- Fan et al. — Hierarchical Neural Story Generation (2018) — Popularized top-k sampling for open-ended text generation
- Radford et al. — Language Models are Unsupervised Multitask Learners (2019) — The GPT-2 paper that brought top-k sampling to mainstream attention
- Vaswani et al. — Attention Is All You Need (2017) — The Transformer paper, which used beam search for machine translation
- DadOps — Softmax & Temperature from Scratch — Our deep dive into the softmax function and temperature scaling that feeds into decoding
- DadOps — Attention Is All You Need (To Implement) — Building the attention mechanism that produces the logits we decode from