← Back to Blog

RLHF from Scratch: How Language Models Learn What Humans Want

The Alignment Gap

We've built a complete transformer from scratch across this series. Tokenization to break text into subwords. Embeddings to turn tokens into vectors. Positional encoding, attention, feed-forward networks, normalization — every component of a modern transformer. We've even optimized inference with KV caching and speculative decoding, learned to fine-tune with LoRA and shrink models with quantization.

And yet — a model trained on all this machinery to predict the next token is not a useful assistant. Ask it "How do I cook pasta?" and it might continue with "is a question that many people ask when..." instead of actually telling you how to cook pasta. It's an autocomplete engine, not a conversationalist. Worse, it'll happily generate harmful content, because harmful text appeared in its training data, and predicting the next token doesn't distinguish between helpful and harmful.

The gap between "autocomplete on steroids" and "useful AI assistant" is called the alignment problem. The solution that powered ChatGPT, Claude, and every modern conversational AI is RLHF — Reinforcement Learning from Human Feedback.

RLHF is a three-stage pipeline:

Pretrained LLM → SFT (teach format) → Reward Model (learn preferences) → PPO/DPO (optimize behavior)
Stage 1: Imitate → Stage 2: Judge → Stage 3: Improve

Let's build each stage from scratch.

Stage 1: Supervised Fine-Tuning (SFT)

Before we can teach a model which responses are better, we need to teach it how to respond at all. A base model trained on internet text doesn't know the format of a conversation. SFT is the apprenticeship stage — you show the model thousands of examples of high-quality instruction-response pairs written by human demonstrators, and it learns to imitate the format.

Here's the surprising part: the training objective is exactly the same as pretraining — cross-entropy next-token prediction. The same loss function we built in our loss functions post. What changes is the data (curated demonstrations instead of raw internet text) and where we compute loss (response tokens only, not the prompt).

import numpy as np

def sft_loss(model, prompt_tokens, response_tokens):
    """
    Supervised fine-tuning: cross-entropy only on the response.

    prompt_tokens:   [What, is, the, capital, of, France, ?]
    response_tokens: [The, capital, of, France, is, Paris, .]
    """
    # Concatenate into one sequence
    full_sequence = prompt_tokens + response_tokens
    # full_sequence shape: (prompt_len + response_len,)

    # Forward pass — get logits at every position
    logits = model.forward(full_sequence[:-1])  # predict next token
    # logits shape: (seq_len - 1, vocab_size)

    targets = full_sequence[1:]  # shifted by 1
    # targets shape: (seq_len - 1,)

    # Build the loss mask: 0 for prompt positions, 1 for response positions
    prompt_len = len(prompt_tokens)
    mask = np.zeros(len(targets))
    mask[prompt_len - 1:] = 1.0  # only compute loss on response tokens
    # mask: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    #        ^--- prompt ---^  ^---- response ----^

    # Cross-entropy loss, masked to response tokens only
    # (log_softmax: log(exp(x_i) / sum(exp(x_j))) — numerically stable)
    log_probs = log_softmax(logits)   # (seq_len - 1, vocab_size)
    token_losses = -log_probs[np.arange(len(targets)), targets]

    masked_loss = (token_losses * mask).sum() / mask.sum()
    return masked_loss

The mask is the key detail. By zeroing out the loss on prompt tokens, we tell the model: "don't learn to generate the user's question — learn to generate the answer." The model still processes the prompt (it needs the context), but gradients only flow from the response portion.

OpenAI's InstructGPT used roughly 13,000 demonstration examples for SFT. That's tiny compared to the trillions of pretraining tokens, but it's enough to teach the format of a helpful response. In practice, SFT is often done with LoRA adapters to keep compute costs low.

SFT teaches format, not preferences. Two different SFT-trained responses to "explain quantum mechanics" might both be grammatically correct and on-topic — but one could be sycophantic, verbose, or subtly wrong. To know which is better, we need a different kind of training signal.

Stage 2: Training a Reward Model

Imagine hiring a food critic. Their job isn't to cook — it's to taste two dishes and tell you which one is better. That's exactly what a reward model does: given a prompt and a response, it outputs a single number — a quality score.

Architecture

The reward model starts as a copy of the SFT model, but with one surgical change: the language modeling head (the final linear layer that projects to vocabulary size) is replaced with a scalar value head — a single linear layer that maps the hidden state at the last token position to one number.

class RewardModel:
    """
    Same transformer backbone, but outputs a scalar reward
    instead of a vocabulary distribution.
    """
    def __init__(self, base_model, hidden_dim):
        self.backbone = base_model       # frozen or fine-tuned transformer
        self.value_head = Linear(hidden_dim, 1)  # projects to scalar

    def forward(self, prompt_and_response_tokens):
        # Run through the transformer backbone
        hidden_states = self.backbone.encode(prompt_and_response_tokens)
        # hidden_states shape: (seq_len, hidden_dim)

        # Take the hidden state at the LAST token position
        last_hidden = hidden_states[-1]   # (hidden_dim,)

        # Project to a single scalar — the "reward"
        reward = self.value_head(last_hidden)  # scalar
        return reward

The Bradley-Terry Preference Model

How do we train this reward model? Not with absolute scores — humans are terrible at assigning consistent numbers on a 1-10 scale. Instead, we show annotators a prompt and two responses, and ask: which is better?

The mathematical foundation is the Bradley-Terry model, which you might know from chess Elo ratings. If response A has latent "strength" (reward) r_A and response B has strength r_B, the probability that a human prefers A over B is:

P(A > B) = σ(rA − rB) = 1 / (1 + exp(-(rA − rB)))

This is just the logistic sigmoid — the same function from our softmax post (sigmoid is softmax with 2 classes). The crucial property: only the difference in rewards matters. Adding 100 to both rewards doesn't change anything, just like adding 100 to every chess player's Elo doesn't change who wins.

The loss function follows directly. Given a prompt x, a chosen (preferred) response y_c, and a rejected response y_r:

L = −log σ(r(x, yc) − r(x, yr))

This is binary cross-entropy applied to the reward difference. When the reward model correctly scores the chosen response higher, the loss is small. When it gets the ranking wrong, the loss is large. The same cross-entropy we've seen everywhere, wearing a different hat.

def reward_model_loss(reward_model, prompt, chosen_response, rejected_response):
    """
    Bradley-Terry preference loss for reward model training.
    """
    # Score both responses
    r_chosen  = reward_model.forward(prompt + chosen_response)   # scalar
    r_rejected = reward_model.forward(prompt + rejected_response) # scalar

    # Bradley-Terry loss: push r_chosen above r_rejected
    # L = -log(sigmoid(r_chosen - r_rejected))
    diff = r_chosen - r_rejected
    loss = -np.log(sigmoid(diff))
    return loss

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

# Example: reward model sees two responses to "Explain gravity"
# Chosen: "Gravity is a fundamental force..."  (clear, accurate)
# Rejected: "Well, gravity is like, you know..." (vague, rambling)
# After training: r_chosen = 2.1, r_rejected = -0.8
# P(chosen > rejected) = sigmoid(2.1 - (-0.8)) = sigmoid(2.9) = 0.948
# Loss = -log(0.948) = 0.053  — small, because the model got it right!

In practice, annotators rank K responses (say, 4-7 completions) rather than just comparing two. Each ranking produces C(K, 2) pairwise comparisons, dramatically multiplying the training signal. InstructGPT found this approach more data-efficient than collecting pairs independently. They collected about 33,000 comparisons, and trained the reward model for just one epoch to avoid overfitting.

Stage 3a: PPO — Teaching with Reinforcement Learning

We now have a reward model that can score any response. The goal of Stage 3 is to use these scores to improve the language model. This is where reinforcement learning enters the picture.

The objective is deceptively simple: maximize expected reward while staying close to the SFT model.

maxθ E[r(x, y)] − β · DKLθ || πref)

Here πθ is the language model we're optimizing (the "policy" in RL language), πref is the frozen SFT model (the "reference"), and β controls how tight the leash is. The KL divergence term measures how far the policy has drifted from the reference — it's the sum of per-token log-probability differences.

Why the leash? Without it, the model will reward-hack. We'll see this in the next section, but the short version: the reward model is an imperfect proxy for human preferences, and optimizing hard against an imperfect proxy produces degenerate behavior. The KL penalty keeps the model close to "sane" SFT behavior.

The Clipped Surrogate Objective

PPO (Proximal Policy Optimization) from Schulman et al. (2017) is the algorithm used to solve this optimization. Its core idea: don't change the policy too much in one step. The mechanism is a clipped surrogate objective.

First, compute the policy ratio — how much more (or less) likely is the current policy to generate a token compared to the old policy that collected the data:

rt(θ) = πθ(tokent | context) / πold(tokent | context)

Then clip this ratio and take the minimum with the unclipped version:

LCLIP = E[min(rt · At, clip(rt, 1−ε, 1+ε) · At)]

Where At is the advantage — how much better this action was than expected — and ε is typically 0.2. The clipping works like a recipe adjustment: "If the dish was better than expected, add a bit more of that ingredient — but no more than 20% more. If it was worse, reduce it — but no more than 20% less."

def ppo_clipped_loss(log_probs_new, log_probs_old, advantages, epsilon=0.2):
    """
    PPO clipped surrogate objective.

    log_probs_new: log pi_theta(a|s) for each token    — (batch,)
    log_probs_old: log pi_old(a|s) from data collection — (batch,)
    advantages:    A_t = reward - value_baseline         — (batch,)
    epsilon:       clipping range (default 0.2)
    """
    # Policy ratio: how much has the policy changed?
    ratio = np.exp(log_probs_new - log_probs_old)  # pi_new / pi_old

    # Unclipped objective
    surr1 = ratio * advantages

    # Clipped objective — limit how much the ratio can change
    clipped_ratio = np.clip(ratio, 1 - epsilon, 1 + epsilon)
    surr2 = clipped_ratio * advantages

    # Take the MINIMUM — the conservative (pessimistic) estimate
    # This prevents catastrophically large policy updates
    loss = -np.minimum(surr1, surr2).mean()
    return loss

# Example: a token had positive advantage (good action)
# ratio = 1.5 (policy now 50% more likely to generate this token)
# clipped_ratio = clip(1.5, 0.8, 1.2) = 1.2
# With A_t = 0.3:
#   surr1 = 1.5 * 0.3 = 0.45
#   surr2 = 1.2 * 0.3 = 0.36
#   min(0.45, 0.36) = 0.36 — gradient vanishes past the clip boundary
# The policy wants to increase this token's probability even more,
# but PPO says "you've already increased it by 20%, that's enough for now"

The Full PPO Training Loop

Putting it all together, one PPO training step looks like this:

def ppo_training_step(policy, ref_policy, reward_model, value_fn,
                      prompts, beta=0.01, epsilon=0.2):
    """
    One PPO update step for RLHF.

    policy:       the language model we're optimizing (pi_theta)
    ref_policy:   frozen SFT model (pi_ref)
    reward_model: trained reward scorer
    value_fn:     value function baseline (critic)
    prompts:      batch of prompts to generate responses for
    """
    all_losses = []

    for prompt in prompts:
        # 1. GENERATE — sample a response from the current policy
        response, log_probs_old = policy.generate(prompt)

        # 2. SCORE — get reward from the reward model
        reward = reward_model.forward(prompt + response)

        # 3. KL PENALTY — penalize divergence from the reference
        log_probs_ref = ref_policy.log_prob(prompt, response)
        kl_per_token = log_probs_old - log_probs_ref  # token-level KL
        total_reward = reward - beta * kl_per_token.sum()

        # 4. ADVANTAGE — how much better than expected?
        value = value_fn.forward(prompt + response)
        advantage = total_reward - value

        # 5. PPO UPDATE — clipped surrogate loss
        log_probs_new = policy.log_prob(prompt, response)
        loss = ppo_clipped_loss(log_probs_new, log_probs_old,
                                advantage, epsilon)
        all_losses.append(loss)

    # Backpropagate and update policy weights
    avg_loss = np.mean(all_losses)
    policy.backward(avg_loss)
    policy.optimizer_step()  # Adam, typically

    return avg_loss

Notice the complexity: we need four models in memory simultaneously — the policy, the reference (frozen copy), the reward model, and the value function (critic). For a 7B parameter model, that's roughly 56GB of weights in FP16. This is a significant engineering challenge, and one reason why a simpler alternative was desperately needed.

Goodhart's Law: When Reward Models Go Wrong

"When a measure becomes a target, it ceases to be a good measure." — Charles Goodhart

The reward model is a proxy for human preferences, not the real thing. As you optimize harder against this proxy, something breaks. Researchers have documented striking failure modes in real RLHF systems:

Gao et al. (2022) established scaling laws for reward model overoptimization: as you increase the KL budget (allowing more divergence from the reference), the proxy reward keeps climbing, but the true reward — actual human preference — peaks and then degrades. The KL penalty in the PPO objective is the primary defense, but it's imperfect. The InstructGPT team also found that mixing a small amount of the original pretraining gradient into PPO updates helped prevent catastrophic forgetting.

This is why alignment is an ongoing research challenge, not a solved problem. The reward model captures a useful signal about human preferences, but optimizing too hard against it is like teaching to a bad test — you get high scores without real understanding.

Stage 3b: DPO — The Elegant Shortcut

PPO works, but look at what it requires: four models, an RL training loop with its own set of hyperparameters (epsilon, beta, GAE lambda, learning rate schedules), a value function that needs its own training, and careful engineering to keep it all stable. What if we could skip all of that?

In 2023, Rafailov et al. published a paper with a delightful subtitle: "Your Language Model is Secretly a Reward Model." They showed that you can go directly from preference data to an optimal policy with a single supervised loss — no reward model, no RL loop, no value function. This is Direct Preference Optimization (DPO).

The Derivation

The math is worth following step by step, because the partition function cancellation at the end is genuinely beautiful.

Step 1: Start with the same RLHF objective as PPO — maximize reward minus KL penalty.

Step 2: Derive the closed-form optimal policy. Using variational calculus, the policy that maximizes this objective for any reward function r is:

π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y) / β)

where Z(x) is a partition function that sums over all possible responses. This is intractable — you can't sum over every possible token sequence. So how does DPO work?

Step 3: Rearrange to solve for the reward:

r(x, y) = β · log(π*(y|x) / πref(y|x)) + β · log Z(x)

Step 4: Substitute into the Bradley-Terry preference model. For chosen response yc and rejected response yr:

P(yc > yr) = σ(r(x, yc) − r(x, yr))

When we expand the rewards, something magical happens: the partition function cancels.

r(x, yc) − r(x, yr) = β · log(π*/πref)(yc) − β · log(π*/πref)(yr) + β log Z(x) − β log Z(x)

The Z(x) terms are identical (both depend only on the prompt, not the response) and subtract to zero. The intractable normalization constant simply disappears because Bradley-Terry depends only on reward differences.

Step 5: Replace the unknown optimal policy π* with our trainable policy πθ, and we get the DPO loss:

LDPO = −E[log σ(β · (log πθ(yc|x)/πref(yc|x) − log πθ(yr|x)/πref(yr|x)))]

That's it. A classification loss on preference pairs. No reward model. No RL. Just supervised learning with a clever loss function.

def dpo_loss(policy, ref_policy, prompt, chosen, rejected, beta=0.1):
    """
    Direct Preference Optimization loss.
    Collapses reward modeling + RL into a single supervised loss.

    beta: controls how much the policy can diverge from the reference.
          Small beta = trust preferences more, allow more divergence.
          Large beta = trust reference more, stay closer to SFT.
    """
    # Log-probabilities under the current policy
    log_pi_chosen  = policy.log_prob(prompt, chosen)      # scalar
    log_pi_rejected = policy.log_prob(prompt, rejected)    # scalar

    # Log-probabilities under the frozen reference (SFT model)
    log_ref_chosen  = ref_policy.log_prob(prompt, chosen)  # scalar
    log_ref_rejected = ref_policy.log_prob(prompt, rejected) # scalar

    # Log-probability RATIOS — the implicit reward
    log_ratio_chosen  = log_pi_chosen - log_ref_chosen
    log_ratio_rejected = log_pi_rejected - log_ref_rejected

    # DPO loss — push chosen ratio above rejected ratio
    logit = beta * (log_ratio_chosen - log_ratio_rejected)
    loss = -np.log(sigmoid(logit))
    return loss

# The implicit reward for any response:
# r_hat(x, y) = beta * log(pi_theta(y|x) / pi_ref(y|x))
# Your language model IS a reward model — reward is how much
# the policy upweights a response relative to the reference.

Notice what disappeared: no reward model to train, no value function, no advantage estimation, no clipping. DPO needs only two models in memory (policy + frozen reference) instead of four. It has essentially one hyperparameter (β) instead of a dozen. And the implicit reward — β · log(πθref) — gives you a reward model for free.

Think of it this way: standard RLHF trains a food critic, then uses the critic's scores to guide a chef through trial and error. DPO shows the chef pairs of dishes directly — "this one is better than that one" — and the chef internalizes the preferences in one step, no critic needed.

Beyond RLHF: Modern Alignment

The field moves fast. Here are three important extensions that push beyond the original RLHF pipeline.

RLAIF — AI Feedback Instead of Human Feedback

Human annotation is expensive and slow. What if we used an LLM to judge responses instead? This is RLAIF (Reinforcement Learning from AI Feedback). Generate multiple responses, ask a capable model "which is better according to these principles?", and train on the AI-generated preferences. Dramatically cheaper, but the AI judge inherits its own biases — it prefers longer responses, shows positional bias (favoring the first option), and may prefer its own outputs.

Constitutional AI

Anthropic's approach makes the alignment criteria explicit and auditable. Instead of training from opaque human preferences, you write a constitution — a set of natural-language principles like "Choose the response that is most helpful" or "Choose the response that is least likely to cause harm." The model critiques its own responses against these principles, revises them, and the revised responses become training data. Stage 2 uses RLAIF: the model itself judges which response better satisfies the constitution. This makes the "what should the model optimize for?" question transparent and adjustable.

KTO — Beyond Paired Preferences

DPO still requires paired preferences: "response A is better than B for this prompt." Ethayarajh et al. (2024) showed that even this is unnecessary. KTO (Kahneman-Tversky Optimization) works with just binary feedback: "this response is good" or "this response is bad" — like a thumbs up/down button. Grounded in Kahneman and Tversky's prospect theory from behavioral economics, KTO models the asymmetry that humans feel losses more strongly than equivalent gains. It matches DPO performance despite using weaker supervision, and handles contradictory feedback from different annotators more gracefully.

The trend is clear: each generation of alignment techniques needs less supervision — from detailed rankings (RLHF), to pairwise comparisons (DPO), to binary signals (KTO). The math gets simpler, the training gets more stable, and the results hold up.

Try It: The Preference Arena

Click on the response you prefer. Watch the reward model learn your preferences in real time, and see how the policy shifts — while the KL divergence meter tracks how far it's drifted from the base model.

Try It — Preference Arena

0.50
Rounds 0
KL Divergence 0.000
Reward Model Acc. --
Policy Shift --

What We Didn't Cover

RLHF sits at the intersection of deep learning, reinforcement learning, and human psychology. We've built the core pipeline, but there's a vast landscape beyond:

References & Further Reading

We've now built the complete pipeline from raw text to aligned assistant:

tokenize → embed → position → normalize → attend (+ KV cache) → FFN → softmax → loss → optimize → decode → fine-tune → quantize → align

From subword tokens to a model that follows instructions, respects preferences, and stays on its leash. Every piece built from scratch, every equation derived from first principles. That's the whole thing.