Softmax & Temperature from Scratch
The Function Behind Every LLM Decision
Every time you talk to ChatGPT, Claude, or Gemini, the very last thing that happens before you see a word is softmax. The model has produced a wall of raw scores — one number for each of the 100,000+ tokens in its vocabulary — and softmax converts those scores into a probability distribution. It's how the model goes from "I sort of prefer these words" to "I'm choosing this one."
And that "temperature" slider you've seen in every AI playground? The one you crank up for creative writing and turn down for code generation? It's not some vague creativity knob. It's a single number plugged into the softmax equation — the exact same equation that Ludwig Boltzmann used in 1868 to describe how gas molecules distribute themselves across energy states.
In our attention post, we used softmax without much explanation — a three-line helper function that we waved past to focus on queries, keys, and values. Today we open that black box. We'll build softmax from first principles, break it on purpose, fix it with a trick that every production system uses, then explore how temperature reshapes the distribution. Along the way, we'll bust five myths that trip up even experienced practitioners.
By the end, you'll have written about 80 lines of Python that implement softmax, temperature scaling, top-k sampling, and top-p (nucleus) sampling — plus an interactive demo where you can watch temperature reshape a probability distribution in real time.
Why Not Just Divide by the Sum?
Let's start with the obvious question: we want to turn a list of raw scores into probabilities. Probabilities need to be positive and sum to 1. Why not just divide each score by the total?
import numpy as np
# Raw scores (logits) from a model's output layer
logits = np.array([2.0, 1.0, -1.0, 3.0, -0.5])
# Attempt 1: divide by sum
probs = logits / np.sum(logits)
print(probs)
# [ 0.444 0.222 -0.222 0.667 -0.111]
# ^^ Negative "probabilities"! That's not a valid distribution.
Negative logits produce negative "probabilities," which is nonsensical. A probability of −22% for a word doesn't mean anything.
Okay, what about taking absolute values first?
# Attempt 2: absolute values then normalize
probs = np.abs(logits) / np.sum(np.abs(logits))
print(probs)
# [0.267 0.133 0.133 0.400 0.067]
# ^^ Looks plausible, but -1.0 and +1.0 get the same probability!
# We've lost all ordering information.
A score of −1.0 (the model dislikes this token) gets the same probability as +1.0 (the model likes it). That's wrong.
What about min-max normalization?
# Attempt 3: min-max scaling to [0, 1]
probs = (logits - logits.min()) / (logits.max() - logits.min())
probs = probs / probs.sum()
print(probs)
# [0.316 0.211 0.000 0.421 0.053] (approximately)
# ^^ Zero probability for the minimum! And breaks entirely
# when all logits are equal (division by zero).
The minimum-scored token gets exactly zero probability, and the whole thing collapses if all logits are equal. We need something better.
What we actually need is a function that:
- Maps any real number to a positive value (no negatives)
- Preserves ordering (bigger input → bigger output)
- Gives us numbers that can be normalized to sum to 1
There's a function that does all three at once: the exponential.
Softmax from First Principles
The exponential function exp(x) = ex is the perfect building block. It's always positive (exp(-100) ≈ 0.0000...00037, tiny but never negative), it's strictly increasing (bigger input always gives bigger output), and it amplifies differences between values (because it grows super-exponentially).
The softmax formula simply exponentiates each score and then normalizes by the total:
softmax(zi) = exp(zi) / Σj exp(zj)
Let's implement it and see what happens with our same logits:
def softmax_naive(z):
"""Softmax: exponentiate and normalize."""
e = np.exp(z)
return e / np.sum(e)
logits = np.array([2.0, 1.0, -1.0, 3.0, -0.5])
probs = softmax_naive(logits)
print(probs)
# [0.237 0.087 0.012 0.645 0.019]
print(f"Sum: {probs.sum():.6f}")
# Sum: 1.000000
Now we're talking. Every probability is positive, they sum to 1, and the ordering is preserved: the logit of 3.0 gets the highest probability (64.5%), the logit of −1.0 gets the lowest (1.2%). Notice how aggressively softmax amplifies the gaps — a 3.0 vs 2.0 difference in logits translates to a 64% vs 24% difference in probability.
Let's trace through the math by hand for a tiny example to build intuition:
# Three logits: a clear winner, a runner-up, a loser
z = np.array([5.0, 2.0, 1.0])
# Step 1: exponentiate each
exps = np.exp(z)
print(f"exp(5.0) = {exps[0]:.2f}") # 148.41
print(f"exp(2.0) = {exps[1]:.2f}") # 7.39
print(f"exp(1.0) = {exps[2]:.2f}") # 2.72
# Step 2: sum them up
total = np.sum(exps)
print(f"Total: {total:.2f}") # 158.51
# Step 3: divide each by total
probs = exps / total
print(f"Probabilities: {probs}")
# [0.936 0.047 0.017]
# The "5.0" logit captures 93.6% of the probability mass!
One property of softmax that turns out to be crucial is shift invariance: you can add any constant to all logits and the result doesn't change.
z = np.array([5.0, 2.0, 1.0])
print(softmax_naive(z))
# [0.936 0.047 0.017]
# Add 100 to everything
print(softmax_naive(z + 100))
# [0.936 0.047 0.017] ← identical!
# Subtract 5 from everything
print(softmax_naive(z - 5))
# [0.936 0.047 0.017] ← still identical!
Why? Because the constant cancels out in the fraction:
exp(zi + c) / Σ exp(zj + c) = exp(zi) · exp(c) / Σ exp(zj) · exp(c) = exp(zi) / Σ exp(zj)
The exp(c) factor appears in both numerator and denominator, so it divides out. Only the relative differences between logits matter — not their absolute values. This property is about to save us from a very nasty numerical bug.
The Numerical Stability Trick
Our naive softmax works beautifully on small numbers. But try this:
# Logits that a real model might produce
big_logits = np.array([1000.0, 1001.0, 999.0])
print(softmax_naive(big_logits))
# [nan nan nan] ← oh no.
What happened? exp(1000) is a number with 434 digits. A 32-bit float can only handle up to about exp(88) before it overflows to infinity. And inf / inf = nan.
This isn't an edge case. LLM logits routinely hit large values — the output layer projects into a vocabulary of 100K+ tokens, and scores of ±50 or more are common. Our naive implementation would explode in production.
The fix exploits the shift invariance we just proved. Before exponentiating, subtract the maximum value:
def softmax(z):
"""Numerically stable softmax."""
z_shifted = z - np.max(z) # shift so max is 0
e = np.exp(z_shifted) # largest exp is now exp(0) = 1
return e / np.sum(e)
# Same big logits, now stable
big_logits = np.array([1000.0, 1001.0, 999.0])
print(softmax(big_logits))
# [0.245 0.665 0.090] ← works perfectly!
After the shift, the largest value in z_shifted is 0, so the largest exponent is exp(0) = 1. Overflow is impossible. The smaller values might underflow to near-zero, but that's fine — they genuinely represent tiny probabilities.
This three-line trick is what every production system uses. PyTorch's torch.softmax, TensorFlow's tf.nn.softmax, NumPy-based inference servers — they all subtract the max first. It's free (just one extra pass over the array) and it prevents catastrophic failure.
For computinglog(softmax(z))— which you need for cross-entropy loss during training — there's an extended version called LogSumExp that stays entirely in the log domain to avoid takinglog(0). But for inference, the subtract-max trick is all you need.
Let's verify our stable version matches the naive one on safe inputs:
# Verify: both give identical results on safe inputs
safe_logits = np.array([2.0, 1.0, -1.0, 3.0, -0.5])
print(f"Naive: {softmax_naive(safe_logits)}")
print(f"Stable: {softmax(safe_logits)}")
# Naive: [0.237 0.087 0.012 0.645 0.019]
# Stable: [0.237 0.087 0.012 0.645 0.019] ← identical!
# But only the stable version handles extreme inputs
extreme = np.array([500.0, 502.0, 498.0])
print(f"Naive: {softmax_naive(extreme)}") # [nan nan nan]
print(f"Stable: {softmax(extreme)}") # [0.117 0.867 0.016]
Enter Temperature: From Boltzmann to ChatGPT
In 1868, the Austrian physicist Ludwig Boltzmann was studying how gas molecules distribute themselves across different energy states in thermal equilibrium. He derived a formula for the probability of finding a molecule in state i with energy Ei at temperature T:
P(statei) = exp(−Ei / kT) / Z
Here Z is the partition function (the normalizing sum over all states) and k is Boltzmann's constant. Look familiar? Strip away the physics notation and you have exactly softmax with a temperature parameter. The mapping is direct:
| Physics (Boltzmann) | Machine Learning |
|---|---|
| Energy Ei | Negative logit −zi |
| Temperature T | Temperature parameter T |
| Partition function Z | Denominator Σ exp(zj/T) |
| Boltzmann constant k | Absorbed into the logit scale |
This isn't a metaphor. It's literally the same equation with different variable names. When ML researchers adopted the term "temperature," they were acknowledging the physics heritage.
The implementation is one extra division:
def softmax_with_temperature(z, T=1.0):
"""Softmax with temperature scaling."""
z_scaled = z / T # scale logits by temperature
z_shifted = z_scaled - np.max(z_scaled) # numerical stability
e = np.exp(z_shifted)
return e / np.sum(e)
Let's watch what temperature does to the same set of logits:
logits = np.array([2.0, 1.0, 0.5, 0.0, -0.5])
tokens = ["mat", "rug", "floor", "carpet", "ground"]
for T in [0.1, 0.5, 1.0, 2.0, 5.0]:
probs = softmax_with_temperature(logits, T)
top = tokens[np.argmax(probs)]
print(f"T={T:4.1f} {np.array2string(probs, precision=3, floatmode='fixed')}"
f" top: {top} ({probs.max():.1%})")
# T= 0.1 [1.000 0.000 0.000 0.000 0.000] top: mat (100.0%)
# T= 0.5 [0.826 0.112 0.041 0.015 0.006] top: mat (82.6%)
# T= 1.0 [0.553 0.203 0.123 0.075 0.045] top: mat (55.3%)
# T= 2.0 [0.366 0.222 0.173 0.135 0.105] top: mat (36.6%)
# T= 5.0 [0.261 0.213 0.193 0.175 0.158] top: mat (26.1%)
The pattern is clear. As temperature rises:
- T → 0 ("frozen"): The distribution collapses to a spike on the highest logit. This is greedy decoding — the model always picks its top choice. In Boltzmann's world, at absolute zero, every molecule settles into the lowest energy state.
- T = 1 (baseline): Standard softmax, the model's default confidence levels.
- T → ∞ ("boiling"): All logits get divided by a huge number, approaching zero.
exp(0) = 1for all tokens, so the distribution flattens toward uniform — every word is equally likely. The model becomes a random word generator.
Think of temperature as a confidence dial. Low temperature says "I'm very sure, just pick the best one." High temperature says "I see several options — let me explore." The model's underlying preferences (the logits) don't change. Temperature only controls how much those preferences matter when it's time to make a choice.
Here's a practical guide to what different temperature values feel like:
| Temperature | Behavior | Good For |
|---|---|---|
0.0 – 0.3 |
Near-deterministic, repetitive | Factual Q&A, code generation, structured output |
0.5 – 0.7 |
Focused but with some variety | Technical writing, summarization |
0.8 – 1.0 |
Balanced (model default) | General conversation, explanations |
1.0 – 1.5 |
More varied and creative | Creative writing, brainstorming |
> 2.0 |
Increasingly chaotic | Rarely useful in practice |
We can quantify the "spread" of a distribution using entropy — a single number that measures how uncertain the distribution is. Maximum entropy means a uniform distribution (total uncertainty), minimum entropy means a single spike (total certainty):
def entropy(probs):
"""Shannon entropy in bits."""
# Filter out zeros to avoid log(0)
p = probs[probs > 0]
return -np.sum(p * np.log2(p))
logits = np.array([2.0, 1.0, 0.5, 0.0, -0.5])
for T in [0.1, 0.5, 1.0, 2.0, 5.0]:
probs = softmax_with_temperature(logits, T)
H = entropy(probs)
print(f"T={T:4.1f} entropy={H:.3f} bits (max possible: {np.log2(5):.3f})")
# T= 0.1 entropy=0.000 bits (max possible: 2.322)
# T= 0.5 entropy=0.906 bits (max possible: 2.322)
# T= 1.0 entropy=1.793 bits (max possible: 2.322)
# T= 2.0 entropy=2.182 bits (max possible: 2.322)
# T= 5.0 entropy=2.299 bits (max possible: 2.322)
At T=0.1, entropy is essentially 0 — a one-hot spike. At T=5.0, entropy is 2.299 out of a maximum possible 2.322 bits (for 5 tokens), meaning the distribution is nearly uniform. Temperature is literally an entropy dial.
Softmax in the Wild
Now that we've built softmax and temperature from scratch, let's spot them in the architecture of a real transformer. If you've read our attention post, you've already seen softmax in action — you just might not have noticed how central it is.
1. Attention Weights
The core attention formula:
Attention(Q, K, V) = softmax(Q KT / √dk) · V
Softmax converts raw dot-product similarity scores into a probability distribution over positions. Each token "attends" to every other token with a weight between 0 and 1, and the weights sum to 1. But here's something we glossed over in the attention post: the √dk scaling factor is itself a kind of temperature.
# The attention scaling factor IS a temperature
d_k = 64 # typical head dimension
T_effective = np.sqrt(d_k) # = 8.0
# This is equivalent to:
# softmax(QK^T / 8.0) = softmax_with_temperature(QK^T, T=8.0)
When d_k = 64, dividing by √64 = 8 is equivalent to applying softmax with temperature T=8. Without this scaling, the dot products grow with dimension (their variance is proportional to dk), pushing softmax into its near-one-hot saturation regime where gradients vanish. The scaling keeps the distribution smooth enough for gradients to flow.
2. The Output Layer
The final layer of a decoder-only transformer (GPT, Claude, Llama) produces logits over the entire vocabulary — typically 32,000 to 200,000+ tokens. Softmax converts these logits into a probability distribution over next tokens. This is where temperature scaling is applied during text generation.
3. Mixture of Experts Routing
In MoE architectures like Mixtral, a routing network uses softmax to compute gating weights that decide which expert(s) process each token. The "temperature" of this routing softmax controls whether tokens are sent to one dominant expert or distributed more evenly.
Sampling Strategies: Top-k, Top-p, and Friends
Temperature alone controls the shape of the probability distribution, but in practice it's combined with truncation strategies that cut off the long tail of low-probability tokens. Here's the full sampling pipeline that runs every time an LLM generates a token:
- Model produces logits (raw scores for each vocabulary token)
- Temperature scaling: divide logits by T
- Top-k filtering: keep only the k highest-scored tokens
- Top-p filtering: keep the smallest set of tokens whose cumulative probability exceeds p
- Renormalize and sample
Top-k Sampling
The simplest truncation: pick a fixed number k and zero out everything else.
def top_k_filter(logits, k):
"""Zero out all logits except the top k."""
if k >= len(logits):
return logits.copy()
indices = np.argsort(logits)[-k:] # indices of top k
filtered = np.full_like(logits, -np.inf)
filtered[indices] = logits[indices]
return filtered
logits = np.array([3.0, 2.5, 2.0, 1.5, 1.0, 0.5, -0.5, -1.0])
tokens = ["mat", "rug", "floor", "carpet", "ground", "tile", "dirt", "mud"]
filtered = top_k_filter(logits, k=3)
probs = softmax(filtered)
for tok, p in zip(tokens, probs):
if p > 0.001:
print(f" {tok:8s} {p:.1%}")
# mat 50.7%
# rug 30.7%
# floor 18.6%
The problem with top-k: k is static. When the model is confident (one token has 95% probability), k=50 still wastes computation on 49 irrelevant tokens. When the model is uncertain (flat distribution), k=50 might cut off plausible continuations.
Top-p (Nucleus) Sampling
Introduced by Holtzman et al. (2020), top-p is smarter. Instead of a fixed count, it keeps the smallest set of tokens whose cumulative probability exceeds a threshold p:
def top_p_filter(logits, p):
"""Keep the smallest set of tokens with cumulative probability >= p."""
probs = softmax(logits)
sorted_indices = np.argsort(probs)[::-1] # highest prob first
sorted_probs = probs[sorted_indices]
cumulative = np.cumsum(sorted_probs)
# Find cutoff: first index where cumulative >= p
cutoff_idx = np.searchsorted(cumulative, p) + 1
keep_indices = sorted_indices[:cutoff_idx]
filtered = np.full_like(logits, -np.inf)
filtered[keep_indices] = logits[keep_indices]
return filtered
# With p=0.9, nucleus size adapts to model confidence
logits = np.array([3.0, 2.5, 2.0, 1.5, 1.0, 0.5, -0.5, -1.0])
tokens = ["mat", "rug", "floor", "carpet", "ground", "tile", "dirt", "mud"]
filtered = top_p_filter(logits, p=0.9)
probs = softmax(filtered)
print("Top-p=0.9 nucleus:")
for tok, p_val in zip(tokens, probs):
if p_val > 0.001:
print(f" {tok:8s} {p_val:.1%}")
# Top-p=0.9 nucleus:
# mat 42.9%
# rug 26.0%
# floor 15.8%
# carpet 9.6%
# ground 5.8%
# (5 tokens capture >90% of the mass)
The nucleus size is dynamic. When the model is confident, the nucleus shrinks to just one or two tokens. When the model is uncertain, the nucleus expands to include more candidates. This elegantly solves the static-k problem.
How Temperature Interacts with Truncation
Temperature and top-p are not independent knobs — they interact in important ways. Temperature comes first and reshapes the distribution. Top-p then truncates it. At low temperature, top-p=0.9 might keep only 1–2 tokens (the distribution is already a spike). At high temperature, top-p=0.9 might keep hundreds of tokens (the distribution is nearly flat).
# Temperature affects nucleus size
logits = np.array([3.0, 2.5, 2.0, 1.5, 1.0, 0.5, -0.5, -1.0])
for T in [0.3, 1.0, 2.0]:
scaled = logits / T
probs = softmax(scaled)
sorted_probs = np.sort(probs)[::-1]
cumulative = np.cumsum(sorted_probs)
nucleus_size = np.searchsorted(cumulative, 0.9) + 1
print(f"T={T:.1f}: nucleus size for p=0.9 is {nucleus_size} tokens")
# T=0.3: nucleus size for p=0.9 is 2 tokens
# T=1.0: nucleus size for p=0.9 is 5 tokens
# T=2.0: nucleus size for p=0.9 is 6 tokens
A practical starting point: temperature 0.7 with top-p 0.9 works well for most conversational tasks. For code generation, try temperature 0.2 with top-p 0.95. Always tune them together, not independently.
Try It: The Temperature Dial
Interactive: Temperature & Sampling Explorer
Adjust the temperature and sampling parameters to see how they reshape the probability distribution. The scenario: the model has just seen "The cat sat on the" and is choosing the next word.
Five Things Everyone Gets Wrong
1. "Softmax" Is a Soft Version of Max
It's not. Softmax is a smooth approximation of argmax — it returns a distribution pointing toward the index of the maximum, not the maximum value itself. The function that smoothly approximates the max operation is LogSumExp: log(sum(exp(z))). Even Goodfellow, Bengio, and Courville acknowledged in their Deep Learning textbook that "softargmax" would be a more accurate name. The wrong name stuck.
2. Temperature 0 Makes LLMs Deterministic
Temperature 0 (implemented as greedy decoding / argmax) removes sampling randomness, but the output can still vary between runs due to floating-point non-associativity in GPU parallel reductions, dynamic batching on inference servers changing the numerical execution path, and hardware-level nondeterminism in operations like atomicAdd. Research has documented accuracy variations of up to 10–15% across "identical" T=0 runs with the same prompt.
3. Higher Temperature Makes the Model "More Creative"
Temperature does not modify the model's weights, knowledge, or reasoning. It only changes the sampling distribution over the model's existing predictions. A token the model scored as unlikely at T=1 is still unlikely at T=1.5 — it's just slightly less unlikely relative to the top tokens. The model isn't thinking more creatively; you're rolling a less-loaded die.
4. Softmax Outputs Are Calibrated Probabilities
When a model says 80% confidence, it's not necessarily right 80% of the time. Modern neural networks are notoriously overconfident. Guo et al. (2017) showed that deep networks' softmax outputs are poorly calibrated and proposed — ironically — temperature scaling as a post-hoc fix: fit a single temperature T on a validation set to improve the match between predicted confidence and actual accuracy.
5. You Should Max Out Temperature for Maximum Diversity
Past a certain point (roughly T>2), high temperature doesn't produce interesting variety — it degrades into incoherent nonsense. The distribution becomes so flat that low-quality tokens get sampled frequently. If you want diversity with coherence, it's better to use moderate temperature (1.0–1.3) with top-p sampling than to crank temperature to 3.0.
What We Didn't Cover
There's more to the softmax story than fits in one post. Some threads worth pulling:
- Gumbel-softmax — a trick for making discrete sampling differentiable, enabling backpropagation through categorical choices
- Softmax alternatives — sparsemax (produces exact zeros), entmax (tunable sparsity), and "softmax1" (adding 1 to the denominator so attention heads can attend to nothing)
- The Jacobian — softmax's derivative is a full matrix
diag(s) - s sT, which matters when you need to backpropagate through softmax manually - Min-p sampling — a newer alternative to top-p that sets a floor relative to the top token's probability, scaling naturally with model confidence
- Speculative decoding — how draft-model probabilities and main-model probabilities are compared using softmax to speed up generation
References & Further Reading
- Vaswani et al. — "Attention Is All You Need" (2017) — the transformer paper that put softmax center stage
- Holtzman et al. — "The Curious Case of Neural Text Degeneration" (2020) — introduced nucleus (top-p) sampling
- Bridle — "Training Stochastic Model Recognition Algorithms" (1989) — the paper that coined the term "softmax"
- Guo et al. — "On Calibration of Modern Neural Networks" (2017) — temperature scaling for calibration
- Evan Miller — "Attention Is Off By One" — the softmax1 proposal and its implications for attention
- Jay Mody — "Numerically Stable Softmax" — an excellent walkthrough of the stability tricks
- Boltzmann distribution (Wikipedia) — the physics origin of the softmax + temperature equation
- "Turning Up the Heat: Min-p Sampling" (2024) — the newest sampling strategy, designed for high-temperature coherence