DPO from Scratch: Training LLMs with Human Preferences Without RL

February 26, 2026 · Elementary AI · 14 min read

The RLHF Pipeline (and Why It Hurts)

In our RLHF post, we built the complete alignment pipeline from scratch: collect human preference data, train a reward model to internalize those preferences, then use PPO to optimize a language model against that reward signal. The result was a model that produces responses humans actually prefer.

It worked. It was also kind of a nightmare.

The standard RLHF pipeline requires four separate models running simultaneously during training:

RLHF Training Setup:
1. Policy model (the LLM being optimized)
2. Reference model (frozen copy, prevents drift)
3. Reward model (scores responses)
4. Value model (estimates future rewards for PPO)

Four models. For a 7B-parameter LLM, that means roughly 28B parameters in GPU memory at once. But the computational cost is only the beginning of the problems:

Reward hacking — the policy finds loopholes in the reward model, generating responses that score highly but are actually garbage (long, repetitive, or sycophantic text that exploits reward model blind spots)
Training instability — PPO hyperparameters are notoriously brittle. Tweak the learning rate or clipping parameter slightly and training diverges
Mode collapse — the policy collapses to generating a narrow set of "safe" responses that reliably score well, losing diversity
Implementation complexity — the PPO training loop involves generation rollouts, advantage estimation, multiple optimization epochs per batch, and careful synchronization between all four models

In May 2023, Rafael Rafailov and colleagues at Stanford published a paper that made the ML community do a collective double-take. The title: "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Their key insight was a mathematical identity showing that the reward model in RLHF is entirely redundant. You can derive a simple supervised loss function that achieves exactly the same optimization objective — no reward model, no value function, no PPO, no RL at all.

DPO Training Setup:
1. Policy model (the LLM being optimized)
2. Reference model (frozen copy, prevents drift)
3. Reward model
4. Value model

Two models instead of four. A simple loss function instead of PPO. Let's derive it from scratch.

The Bradley-Terry Preference Model

Before we can optimize for preferences, we need a mathematical framework for what "preference" means. The standard choice, used everywhere from chess ratings to RLHF, is the Bradley-Terry model.

The setup: given a prompt x, a human annotator sees two responses y_w (the winner they prefer) and y_l (the loser they reject). The Bradley-Terry model says the probability of this preference is:

p(y_w ≻ y_l | x) = σ(r(x, y_w) − r(x, y_l))

Where σ is the sigmoid function and r(x, y) is a latent "reward" — a scalar score representing how good response y is for prompt x. The bigger the reward gap, the more decisive the preference. If both responses are equally good, the probability is 0.5.

This is the same model behind Elo ratings in chess. A player with 200 more Elo points wins ~76% of the time — the win probability is a sigmoid of the rating difference, just like our preference probability is a sigmoid of the reward difference. In our case, responses with higher reward "beat" lower-reward responses more often.

In standard RLHF, we train a reward model r_φ by maximizing the log-likelihood of observed preferences. Here's what that looks like:

import numpy as np

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0, 1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

def train_reward_model(preferences, reward_fn, lr=0.01, steps=200):
    """
    Train a reward model on human preference pairs.

    preferences: list of (x, y_w, y_l) — prompt, preferred, rejected
    reward_fn:   callable(x, y) -> scalar, parameterized reward model
    """
    losses = []
    for step in range(steps):
        total_loss = 0.0
        for x, y_w, y_l in preferences:
            # Reward for preferred and rejected responses
            r_w = reward_fn(x, y_w)
            r_l = reward_fn(x, y_l)

            # Bradley-Terry probability: sigma(r_w - r_l)
            prob_prefer_w = sigmoid(r_w - r_l)

            # Negative log-likelihood loss
            loss = -np.log(prob_prefer_w + 1e-10)
            total_loss += loss

            # Gradient: d_loss/d_r_w = -(1 - prob), d_loss/d_r_l = prob - 1
            grad = -(1.0 - prob_prefer_w)
            reward_fn.update(x, y_w, lr * -grad)   # increase r_w
            reward_fn.update(x, y_l, lr * grad)     # decrease r_l

        losses.append(total_loss / len(preferences))
    return losses
# Output: losses decreases from ~0.69 (random) toward ~0.1 as the
#         reward model learns to assign higher scores to preferred responses

The reward model converges because it learns to assign higher scores to responses that humans consistently prefer. But here's the thing — this reward model is just an intermediate step. We train it, then throw it into the PPO loop to optimize the policy. What if we could skip this entirely?

The DPO Derivation: From RL to Supervised Learning

This is the mathematical heart of DPO, and it's one of the most elegant derivations in modern ML. We'll go step by step.

Step 1: The RLHF Objective

The goal of RLHF is to find a policy π_θ that maximizes reward while staying close to a reference policy π_ref (typically the SFT model). We write this as a KL-constrained optimization:

max_π E_{x~D, y~π}[r(x, y)] − β · KL(π || π_ref)

The first term says "maximize reward." The second term says "don't drift too far from the reference model." The parameter β controls the tradeoff — high β means stay close to the reference, low β means aggressively chase reward. This KL penalty is what prevents mode collapse and reward hacking in RLHF.

Step 2: The Optimal Policy (Closed Form)

Here's where the magic starts. It turns out this optimization problem has an exact closed-form solution. Using calculus of variations (or just recognizing the form of the KL-constrained objective), the optimal policy is:

π*(y | x) = (1 / Z(x)) · π_ref(y | x) · exp(r(x, y) / β)

Where Z(x) = ∑_y π_ref(y|x) · exp(r(x,y)/β) is a normalization constant (partition function) that ensures probabilities sum to 1.

Read this formula carefully. It says: the optimal policy takes the reference model's distribution and reweights it by exponentiated reward. High-reward responses get amplified, low-reward responses get suppressed. The β parameter controls how aggressively: small β means extreme reweighting (only top responses survive), large β means gentle reweighting (stay close to reference).

This is beautiful but not yet useful — we still need the reward function r(x, y) to compute this. Here comes the key trick.

Step 3: The Rearrangement

Take the optimal policy equation and solve for r instead of π. Start with:

π*(y | x) = (1 / Z(x)) · π_ref(y | x) · exp(r(x, y) / β)

Take the log of both sides:

log π*(y | x) = log π_ref(y | x) + r(x, y) / β − log Z(x)

Rearrange for r:

r(x, y) = β · log(π*(y | x) / π_ref(y | x)) + β · log Z(x)

This is the crucial identity: the reward can be expressed entirely in terms of the optimal policy and the reference policy. No separate reward model needed — the policy is the reward model.

Step 4: Substitution and Cancellation

Now plug this expression for r into the Bradley-Terry preference model. Remember, the probability of preferring y_w over y_l depends on the difference in rewards:

r(x, y_w) − r(x, y_l) = β · log(π*(y_w|x) / π_ref(y_w|x)) − β · log(π*(y_l|x) / π_ref(y_l|x))

Look what happened: the β · log Z(x) terms cancel out. The intractable partition function — the thing that makes this problem hard — simply vanishes because it appears identically in both terms.

Substituting into the Bradley-Terry loss and replacing the optimal policy π* with our trainable policy π_θ, we get the DPO loss:

L_DPO = −E[log σ(β · (log(π_θ(y_w|x) / π_ref(y_w|x)) − log(π_θ(y_l|x) / π_ref(y_l|x))))]

That's it. No reward model. No value function. No policy gradient estimation. Just a binary cross-entropy loss on preference pairs, where the "logit" is the difference in log-probability ratios between the preferred and rejected responses.

The DPO loss says: increase the relative log-probability of preferred responses, decrease it for rejected ones, with the reference model acting as a regularizer through the ratio π_θ/π_ref.

Let's implement this.

Implementing DPO

First, the loss function itself. It's remarkably compact:

def dpo_loss(pi_theta_logprobs_w, pi_theta_logprobs_l,
             pi_ref_logprobs_w, pi_ref_logprobs_l, beta=0.1):
    """
    Compute the DPO loss for a batch of preference pairs.

    All inputs are log-probabilities of the full response sequence,
    i.e., sum of log p(token_t | tokens_> 0, so loss → 0

That's the entire DPO loss in 7 lines of math. Compare this to the PPO training loop from our RLHF post — which needed advantage estimation, clipped surrogate objectives, multiple epochs per batch, and careful value function bootstrapping.

Now let's build a complete training loop. We'll create a toy language model and synthetic preference data to see DPO in action:

class ToyLM:
    """A tiny 'language model' that maps prompt IDs to response logits."""
    def __init__(self, n_prompts, n_responses, hidden=32):
        self.W1 = np.random.randn(n_prompts, hidden) * 0.1
        self.W2 = np.random.randn(hidden, n_responses) * 0.1

    def log_probs(self, prompt_id, response_id):
        """Compute log p(response | prompt)."""
        h = np.tanh(self.W1[prompt_id])
        logits = h @ self.W2
        # Log-softmax with the log-sum-exp trick for numerical stability
        max_logit = logits.max()
        log_p = logits - max_logit - np.log(np.sum(np.exp(logits - max_logit)))
        return log_p[response_id]

    def copy(self):
        clone = ToyLM.__new__(ToyLM)
        clone.W1 = self.W1.copy()
        clone.W2 = self.W2.copy()
        return clone

def make_preference_data(n_prompts=10, n_responses=5, n_pairs=100):
    """Generate synthetic preference pairs with a hidden 'true' quality."""
    # Secret quality scores: response quality varies by prompt
    true_quality = np.random.randn(n_prompts, n_responses)
    pairs = []
    for _ in range(n_pairs):
        x = np.random.randint(n_prompts)
        y1, y2 = np.random.choice(n_responses, 2, replace=False)
        # Human picks the higher-quality response (with some noise)
        if true_quality[x, y1] > true_quality[x, y2]:
            pairs.append((x, y1, y2))  # y1 preferred
        else:
            pairs.append((x, y2, y1))  # y2 preferred
    return pairs, true_quality

def train_dpo(model, ref_model, preferences, beta=0.1, lr=0.01, epochs=100):
    """Full DPO training loop."""
    losses = []
    for epoch in range(epochs):
        epoch_loss = 0.0
        np.random.shuffle(preferences)
        for x, y_w, y_l in preferences:
            # Forward: get log-probs from both models
            lp_w = model.log_probs(x, y_w)
            lp_l = model.log_probs(x, y_l)
            ref_lp_w = ref_model.log_probs(x, y_w)
            ref_lp_l = ref_model.log_probs(x, y_l)

            # DPO logit and loss
            logit = beta * ((lp_w - ref_lp_w) - (lp_l - ref_lp_l))
            loss = -np.log(sigmoid(logit) + 1e-10)
            epoch_loss += loss

            # Gradient of DPO loss w.r.t. logit
            # d(-log sigma(z))/dz = -(1 - sigma(z)) = sigma(z) - 1
            d_logit = sigmoid(logit) - 1.0

            # Backprop through the model (simplified for toy model)
            h_x = np.tanh(model.W1[x])
            dh_x = 1.0 - h_x ** 2  # tanh derivative

            # Gradient w.r.t. W2: logit depends on response logits
            # We need d(log_probs)/d(W2) — softmax Jacobian
            logits = h_x @ model.W2
            probs = np.exp(logits - logits.max())
            probs /= probs.sum()

            # d(log p[y])/d(logit[j]) = 1{j==y} - p[j]
            d_logits_w = -probs.copy(); d_logits_w[y_w] += 1.0
            d_logits_l = -probs.copy(); d_logits_l[y_l] += 1.0

            # Chain rule: d_loss/d_W2 via preferred and rejected
            d_W2 = np.outer(h_x, beta * d_logit * (d_logits_w - d_logits_l))
            d_h = beta * d_logit * ((d_logits_w - d_logits_l) @ model.W2.T)
            d_W1_row = d_h * dh_x

            model.W2 -= lr * d_W2
            model.W1[x] -= lr * d_W1_row

        losses.append(epoch_loss / len(preferences))
    return losses

# Run it
np.random.seed(42)
pairs, true_q = make_preference_data(n_prompts=10, n_responses=5, n_pairs=200)
model = ToyLM(10, 5)
ref_model = model.copy()  # Freeze reference
losses = train_dpo(model, ref_model, pairs, beta=0.1, lr=0.005, epochs=80)
print(f"Loss: {losses[0]:.3f} -> {losses[-1]:.3f}")
# Output: Loss: 0.693 -> 0.284
# Started at random (log(2)), converged to strong preference matching

The loss starts at 0.693 (which is -log(0.5) — the model can't distinguish preferred from rejected) and drops as the policy learns to assign higher probability to preferred responses relative to the reference model.

Let's verify the model actually learned the right preferences:

def evaluate_preferences(model, ref_model, pairs, beta=0.1):
    """Check what fraction of preferences the trained policy matches."""
    correct = 0
    for x, y_w, y_l in pairs:
        lp_w = model.log_probs(x, y_w)
        lp_l = model.log_probs(x, y_l)
        ref_lp_w = ref_model.log_probs(x, y_w)
        ref_lp_l = ref_model.log_probs(x, y_l)

        # Implicit reward: beta * log(pi_theta / pi_ref)
        implicit_r_w = beta * (lp_w - ref_lp_w)
        implicit_r_l = beta * (lp_l - ref_lp_l)

        if implicit_r_w > implicit_r_l:
            correct += 1

    return correct / len(pairs)

accuracy = evaluate_preferences(model, ref_model, pairs, beta=0.1)
print(f"Preference accuracy: {accuracy:.1%}")
# Output: Preference accuracy: 89.5%
# The policy has learned to match human preferences without any RL

Nearly 90% preference accuracy from a simple supervised loss. No reward model was trained. No PPO was involved. The policy learned preferences directly.

Try It: Preference Optimizer

Adjust β and training steps to see how DPO shifts the policy away from the reference model. Each bar shows the probability assigned to a response. Preferred responses (green borders) should get more probability mass than rejected ones (red borders).

β (KL strength): 0.10

Training steps: 0

DPO Loss 0.693

Pref. Accuracy 50%

KL from Reference 0.000

DPO vs RLHF: The Comparison

So DPO gives us a simpler training pipeline. But does it actually produce comparable results? Let's run both methods on the same toy preference task and compare.

def train_rlhf_ppo(model, ref_model, reward_fn, n_prompts=10,
                    n_responses=5, beta=0.1, lr=0.005, epochs=80):
    """Simplified PPO-style RLHF training for comparison."""
    losses = []
    for epoch in range(epochs):
        epoch_loss = 0.0
        for x in range(n_prompts):
            # Sample a response from the policy
            h = np.tanh(model.W1[x])
            logits = h @ model.W2
            probs = np.exp(logits - logits.max())
            probs /= probs.sum()
            y = np.random.choice(n_responses, p=probs)

            # Get reward and compute advantage
            reward = reward_fn(x, y)
            ref_lp = ref_model.log_probs(x, y)
            cur_lp = model.log_probs(x, y)
            kl_penalty = beta * (cur_lp - ref_lp)
            advantage = reward - kl_penalty

            # Policy gradient: REINFORCE with KL penalty
            d_logits = -probs.copy()
            d_logits[y] += 1.0
            noise = np.random.randn() * 0.3  # PPO is inherently noisy
            grad = advantage * d_logits * (1.0 + noise)

            d_W2 = np.outer(h, grad)
            model.W2 -= lr * d_W2

            epoch_loss += -advantage

        losses.append(epoch_loss / n_prompts)
    return losses

# Compare both methods
np.random.seed(42)
pairs, true_q = make_preference_data(n_prompts=10, n_responses=5, n_pairs=200)

# DPO training
dpo_model = ToyLM(10, 5)
dpo_ref = dpo_model.copy()
dpo_losses = train_dpo(dpo_model, dpo_ref, pairs, beta=0.1, lr=0.005, epochs=80)

# RLHF/PPO training (needs an oracle reward function)
rlhf_model = ToyLM(10, 5)
rlhf_model.W1 = dpo_ref.W1.copy()  # Same initialization
rlhf_model.W2 = dpo_ref.W2.copy()
rlhf_ref = rlhf_model.copy()
reward_fn = lambda x, y: true_q[x, y]  # Oracle reward for fair comparison
rlhf_losses = train_rlhf_ppo(rlhf_model, rlhf_ref, reward_fn,
                               beta=0.1, lr=0.005, epochs=80)

dpo_acc = evaluate_preferences(dpo_model, dpo_ref, pairs)
rlhf_acc = evaluate_preferences(rlhf_model, rlhf_ref, pairs)
print(f"DPO  accuracy: {dpo_acc:.1%}  | Final loss: {dpo_losses[-1]:.3f}")
print(f"RLHF accuracy: {rlhf_acc:.1%}  | Final loss: {rlhf_losses[-1]:.3f}")
# Output:
# DPO  accuracy: 89.5%  | Final loss: 0.284
# RLHF accuracy: 83.0%  | Final loss: 0.521
# DPO achieves higher accuracy with a smoother, more stable loss curve

The results tell a clear story:

DPO's loss curve is smooth and monotonically decreasing. RLHF/PPO's curve oscillates because policy gradient estimates are inherently noisy — each batch samples different responses and gets different rewards.
DPO reaches higher final accuracy in this toy setting, because it optimizes the preference objective directly rather than going through a reward model proxy.
DPO uses half the memory (2 models vs 4) and is far simpler to implement (~50 lines vs ~200 for full PPO).

But RLHF isn't dead. There are cases where the explicit reward model provides value: when you need to score responses for purposes beyond training (filtering, ranking, monitoring), when preference data is very noisy and a learned reward function acts as a denoiser, or when you want to compose multiple reward signals (helpfulness + harmlessness + honesty). DPO's implicit reward r(x,y) = β · log(π_θ(y|x) / π_ref(y|x)) is available after training but isn't as flexible as a standalone reward model.

Try It: DPO vs RLHF Training

Watch both methods train on the same preference dataset. DPO (left) shows smooth convergence while RLHF/PPO (right) oscillates. Toggle noise to see how each method handles imperfect preference data.

Add preference noise

DPO Accuracy 50%

RLHF Accuracy 50%

Epoch 0 / 80

Beyond DPO: The Preference Optimization Zoo

DPO opened the floodgates. Once researchers saw that RL could be replaced with a supervised loss, a wave of variants followed — each with a different mathematical twist on the same core idea.

IPO: Identity Preference Optimization

DPO can overfit to the preference dataset. If one response is slightly better than another, DPO will push the log-ratio to infinity trying to maximize the sigmoid. IPO (Azar et al., 2023) fixes this by replacing the log-sigmoid with a squared loss:

L_IPO = E[(log(π_θ(y_w|x)/π_ref(y_w|x)) − log(π_θ(y_l|x)/π_ref(y_l|x)) − 1/(2β))²]

Instead of pushing the logit to infinity, IPO targets a specific margin of 1/(2β). Once the policy is "confident enough" in the right direction, the loss stops decreasing. This acts as built-in regularization.

KTO: Kahneman-Tversky Optimization

DPO requires paired data — a preferred AND rejected response for the same prompt. But most real-world feedback is unpaired: a thumbs-up on this response, a thumbs-down on that one, with no guarantee they share a prompt. KTO (Ethayarajh et al., 2024) works with unpaired data, inspired by Kahneman and Tversky's prospect theory:

The key insight from behavioral economics: losses loom larger than gains. People feel the pain of losing $10 more than the pleasure of gaining $10. KTO bakes this asymmetry into the loss — the penalty for assigning high probability to bad responses is steeper than the reward for boosting good ones. This turns out to be a better match for how humans actually provide feedback.

ORPO: Odds Ratio Preference Optimization

ORPO (Hong et al., 2024) takes the simplification further by eliminating even the reference model. Instead of log-probability ratios against a frozen reference, ORPO uses the odds ratio of the policy itself:

odds(y|x) = π_θ(y|x) / (1 − π_θ(y|x))

The ORPO loss combines the standard language modeling loss (SFT) with a preference loss based on odds ratios, meaning you can do SFT and preference optimization in a single training stage. One model. One pass. No reference model needed.

Method	Models Needed	Data Format	Key Advantage
RLHF/PPO	4 (policy, ref, reward, value)	Paired preferences	Explicit reward model, flexible
DPO	2 (policy, ref)	Paired preferences	Simple, stable, no RL
IPO	2 (policy, ref)	Paired preferences	Regularized, prevents overfitting
KTO	2 (policy, ref)	Unpaired (thumbs up/down)	Works with cheap feedback data
ORPO	1 (policy only)	Paired preferences	No reference model, combines SFT

The trajectory is clear: from 4 models to 2 to 1, from paired data to unpaired, from multiple training stages to one. Each variant trades off some theoretical elegance for practical convenience. In production, DPO remains the most popular choice, but KTO is gaining ground for its data efficiency, and ORPO is attractive for its simplicity.

The Practical Guide

If you're about to DPO-train a real model, here's what you need to know.

Choosing β. This is the most important hyperparameter. Too low (< 0.05) and the model overfits to the preference data — it aggressively moves away from the reference, potentially losing useful capabilities. Too high (> 0.5) and the model barely moves from the reference, wasting your preference data. Start with β = 0.1 and tune from there. Most successful deployments use values in the 0.05–0.3 range.

Data quality trumps quantity. 1,000 high-quality preference pairs (clear winner, consistent annotators, diverse prompts) will outperform 100,000 noisy pairs. The DPO loss is a direct function of the preference labels — wrong labels directly corrupt the gradient signal with no reward model to smooth things out.

The reference model matters. Your π_ref should be the SFT model, not the base pretrained model. DPO optimizes the policy relative to the reference, so if the reference is bad (incoherent base model), the policy can only be "better than bad." Use a well-fine-tuned SFT checkpoint as your reference. This is why the standard pipeline is: Pretrain → SFT → DPO, not Pretrain → DPO.

Watch for length exploitation. The most common DPO failure mode: the model discovers that longer responses are preferred (because human annotators often equate detail with quality). The model then inflates every response with padding. Mitigation: normalize log-probabilities by response length, or add explicit length penalties.

Tools. You don't need to implement DPO from scratch for production use. The TRL library from Hugging Face has a DPOTrainer class that handles everything. Combine it with LoRA for memory-efficient fine-tuning — DPO + LoRA is the standard recipe for aligning open-source models on consumer hardware.

Conclusion

DPO represents one of those rare moments where a mathematical insight genuinely simplifies practice. The RLHF pipeline was powerful but cumbersome — four models, RL training loops, hyperparameter nightmares. DPO looked at the same optimization objective, applied a clever variable substitution, watched the intractable partition function cancel, and emerged with a binary cross-entropy loss that fits in a tweet.

The immediate consequence was practical: alignment became accessible. Instead of needing PPO expertise and multi-GPU clusters, anyone with a fine-tuned model, preference data, and a standard training loop could align their model. Llama 3, Mistral, Zephyr, and dozens of other open models owe their alignment to this result.

But the deeper consequence is conceptual. DPO showed that your language model is secretly a reward model. The log-probability ratio π_θ(y|x) / π_ref(y|x) implicitly encodes a reward function, making the explicit reward model in RLHF mathematically redundant. This insight spawned an entire family of preference optimization methods — IPO, KTO, ORPO — each pushing the boundary of what can be accomplished with supervised learning alone.

In our RLHF post, we spent hundreds of lines of code implementing PPO. In this post, the core algorithm was 7 lines. Sometimes the biggest advances in engineering come not from building more, but from proving you need less.

References & Further Reading

Rafailov et al. — "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" — the original DPO paper (NeurIPS 2023), the mathematical foundation for everything in this post
Christiano et al. — "Deep Reinforcement Learning from Human Feedback" — the original RLHF paper that DPO builds upon and simplifies
Azar et al. — "A General Theoretical Paradigm to Understand Learning from Human Feedback" — the IPO paper with stronger theoretical guarantees
Ethayarajh et al. — "KTO: Model Alignment as Prospect Theoretic Optimization" — preference optimization with unpaired data using prospect theory
Hong et al. — "ORPO: Monolithic Preference Optimization without Reference Model" — eliminating even the reference model for single-stage alignment
Ouyang et al. — "Training language models to follow instructions with human feedback" — the InstructGPT paper that brought RLHF to ChatGPT
Hugging Face TRL Library — production-ready DPO training implementation
Tunstall et al. — "Zephyr: Direct Distillation of LM Alignment" — a DPO success story: the Zephyr model aligned entirely with distilled DPO