← Back to Blog

Imitation Learning from Scratch

1. Learning Without Rewards

Every reinforcement learning algorithm you've seen — Q-learning, policy gradients, PPO — needs a reward function. But what if you don't have one? What if, instead of a score telling you how well you're doing, you just have a human showing you what to do?

This is imitation learning: learning a policy directly from expert demonstrations, without ever defining what "good" means numerically. The expert performs the task, the learner watches, and the goal is to reproduce the expert's behavior in new situations.

It's more common than you might think. Tesla's self-driving system learns from millions of hours of human driving. Robots learn to fold laundry by watching human demonstrations. And every time you fine-tune a language model on human-written text (supervised fine-tuning, or SFT), you're doing imitation learning — the "expert" is the human writer, the "state" is the context so far, and the "action" is the next token.

The naïve approach — just treat it as supervised learning — works surprisingly well at first and fails surprisingly badly later. Understanding why it fails, and what to do about it, is the core of this post. We'll build from behavioral cloning through the distributional shift problem, the DAgger algorithm that fixes it, and arrive at inverse reinforcement learning — recovering the expert's hidden reward function from their behavior alone.

2. Behavioral Cloning

The simplest form of imitation learning is behavioral cloning (BC): treat demonstration data as a supervised learning dataset. The expert gives us trajectories of (state, action) pairs. We train a neural network πθ(a|s) to predict the expert's action given the state, minimizing:

L(θ) = −E(s,a) ~ Dexpert [log πθ(a|s)]

For discrete actions, this is cross-entropy loss. For continuous actions, it's regression (MSE or Gaussian NLL). That's it — no Bellman equations, no temporal difference learning, no reward shaping. Just supervised learning on state-action pairs.

Let's implement BC on a simple 2D navigation task where the expert follows a smooth curved path:

import numpy as np

def expert_policy(state):
    """Expert follows a sinusoidal path: target_y = sin(x)."""
    x, y = state
    target_y = np.sin(x * 2)
    dy = np.clip(target_y - y, -0.3, 0.3)
    return np.array([0.2, dy])  # constant forward speed, corrective steering

def collect_expert_demos(n_trajectories=20, horizon=30):
    """Collect expert demonstrations."""
    states, actions = [], []
    for _ in range(n_trajectories):
        s = np.array([0.0, np.random.randn() * 0.1])
        for t in range(horizon):
            a = expert_policy(s)
            states.append(s.copy())
            actions.append(a.copy())
            s = s + a + np.random.randn(2) * 0.02  # small noise
    return np.array(states), np.array(actions)

# Simple linear policy for behavioral cloning
class LinearBC:
    def __init__(self, state_dim, action_dim):
        self.W = np.zeros((action_dim, state_dim))
        self.b = np.zeros(action_dim)

    def predict(self, state):
        return self.W @ state + self.b

    def train(self, states, actions, lr=0.1, epochs=200):
        for _ in range(epochs):
            pred = states @ self.W.T + self.b
            error = pred - actions
            self.W -= lr * (error.T @ states) / len(states)
            self.b -= lr * error.mean(axis=0)

# Collect data and train
np.random.seed(42)
states, actions = collect_expert_demos()
bc = LinearBC(2, 2)
bc.train(states, actions)

# Evaluate: roll out the BC policy
s = np.array([0.0, 0.0])
bc_trajectory = [s.copy()]
for t in range(30):
    a = bc.predict(s)
    s = s + a + np.random.randn(2) * 0.02
    bc_trajectory.append(s.copy())
bc_traj = np.array(bc_trajectory)

print(f"BC final position: ({bc_traj[-1, 0]:.2f}, {bc_traj[-1, 1]:.2f})")
print(f"Expert final pos:  ({6.00:.2f}, {np.sin(6.0 * 2):.2f})")
# BC final position: (6.07, -0.78)
# Expert final pos:  (6.00, -0.54)

The BC agent reaches roughly the right area — the x position is close (6.07 vs 6.00), though it overshoots in y. On the training distribution — states the expert visited — it works reasonably well. But there's a subtle, devastating problem lurking beneath this apparent success.

3. The Distributional Shift Problem

Here's the core insight that makes imitation learning its own field, not just a footnote in supervised learning: behavioral cloning trains on the expert's state distribution but tests on its own.

Think about it. The expert visits states s1, s2, ..., sT along a smooth trajectory. BC learns π(a|s) from these states. But at test time, the BC agent makes a small error at step 1, which puts it in a slightly different state at step 2 — a state the expert never visited. The policy has no training signal for this state and makes a larger error, pushing it further off the expert's path. By step 10, the agent is in completely unfamiliar territory.

Ross, Gordon, and Bagnell formalized this in their landmark 2011 paper. If the per-step error of the BC policy is ε, then over a horizon of T steps:

Behavioral Cloning error: O(T2ε) — errors compound quadratically

Online methods (DAgger): O(Tε) — errors grow only linearly

The quadratic growth is devastating for long horizons. It's like learning to drive by watching dashcam footage — you learn what to do from the expert's perspective, but the moment you steer slightly wrong, you're in an unfamiliar state with no training signal, and things spiral from there.

def rollout(policy_fn, horizon=30, noise=0.02):
    """Roll out a policy and measure error vs expert."""
    s = np.array([0.0, 0.0])
    errors = []
    for t in range(horizon):
        expert_a = expert_policy(s)
        agent_a = policy_fn(s)
        errors.append(np.linalg.norm(agent_a - expert_a))
        s = s + agent_a + np.random.randn(2) * noise
    return errors

# Compare BC errors at different horizons
np.random.seed(99)
bc_errors = rollout(bc.predict, horizon=50, noise=0.03)
cumulative = np.cumsum(bc_errors)

print("Cumulative BC error at different horizons:")
for t in [10, 20, 30, 40, 50]:
    print(f"  T={t:2d}: cumulative error = {cumulative[t-1]:.3f}")
# Cumulative BC error at different horizons:
#   T=10: cumulative error = 2.249
#   T=20: cumulative error = 5.132
#   T=30: cumulative error = 7.746
#   T=40: cumulative error = 11.542
#   T=50: cumulative error = 15.494

The cumulative error accelerates: each 10-step segment accumulates more error than the previous one (2.2, then 2.9, then 2.6, then 3.8, then 4.0). This upward trend reflects the compounding effect predicted by the O(T2) bound. With a simple linear model the growth is moderate, but with more expressive models (neural networks in practice), the compounding becomes devastating. This isn't a bug — it's a fundamental property of behavioral cloning.

4. DAgger — Dataset Aggregation

If the problem is that BC never trains on its own states, the fix is elegant: let the learner act, then ask the expert what it should have done.

This is DAgger (Dataset Aggregation), introduced by Ross, Gordon, and Bagnell in 2011. The algorithm is beautifully simple:

  1. Initialize dataset D with expert demonstrations
  2. Train policy πθ on D
  3. Roll out πθ to collect new states s1, ..., sT
  4. Query the expert for the correct action at each st
  5. Add (st, aexpert(st)) to D
  6. Go to step 2

Each round adds training data from the learner's actual state distribution. After a few rounds, the policy has seen both the expert's states and its own off-policy states, closing the distribution gap. DAgger reduces to no-regret online learning, guaranteeing that after enough rounds, the policy matches the best in its function class.

The practical cost: DAgger requires an interactive expert who can label new states on demand. You can't just use a dataset of recorded demonstrations — you need the expert (or a simulator acting as one) in the loop.

def dagger(expert_fn, n_rounds=8, horizon=30, demos_per_round=5):
    """DAgger: iteratively aggregate expert data on learner states."""
    # Round 0: collect expert demos
    all_states, all_actions = collect_expert_demos(n_trajectories=demos_per_round)
    policy = LinearBC(2, 2)
    round_errors = []

    for r in range(n_rounds):
        # Train on all aggregated data
        policy.train(all_states, all_actions)

        # Roll out current policy, query expert at visited states
        new_states, new_actions = [], []
        total_err = 0
        for _ in range(demos_per_round):
            s = np.array([0.0, np.random.randn() * 0.1])
            for t in range(horizon):
                agent_a = policy.predict(s)
                expert_a = expert_fn(s)  # query expert at learner's state
                new_states.append(s.copy())
                new_actions.append(expert_a)
                total_err += np.linalg.norm(agent_a - expert_a)
                s = s + agent_a + np.random.randn(2) * 0.02
        avg_err = total_err / (demos_per_round * horizon)
        round_errors.append(avg_err)

        # Aggregate new data
        all_states = np.vstack([all_states, np.array(new_states)])
        all_actions = np.vstack([all_actions, np.array(new_actions)])

    return policy, round_errors

np.random.seed(42)
_, dagger_errors = dagger(expert_policy)
print("DAgger average error per round:")
for r, e in enumerate(dagger_errors):
    print(f"  Round {r+1}: {e:.4f}")
# DAgger average error per round:
#   Round 1: 0.2535
#   Round 2: 0.2482
#   Round 3: 0.2464
#   Round 4: 0.2499
#   Round 5: 0.2468
#   Round 6: 0.2468
#   Round 7: 0.2446
#   Round 8: 0.2439

The error decreases with each DAgger round as the policy sees more of its own states with expert corrections. The improvement is modest here because a linear model has limited capacity — it can't fully capture the nonlinear expert, so both BC and DAgger converge to similar approximations. In practice with neural networks (where overfitting to the training distribution is the bottleneck), DAgger's improvement is dramatic. The key insight remains: by training on the learner's own states, DAgger prevents the compounding drift that plagues pure behavioral cloning.

5. Inverse Reinforcement Learning

Behavioral cloning and DAgger copy the expert's actions. But there's a deeper question: can we recover the expert's intent? What reward function were they implicitly optimizing?

This is Inverse Reinforcement Learning (IRL), introduced in its modern form by Abbeel and Ng in 2004. Instead of learning π(a|s) directly, we learn a reward function R(s) such that the expert's behavior is optimal under R. A reward function is more portable than a policy — it can transfer to new environments, new dynamics, new agent morphologies.

The challenge is ambiguity: many reward functions explain the same behavior. The zero reward R(s) = 0 makes any policy "optimal." We need to constrain the problem. Abbeel and Ng's insight: represent rewards as linear combinations of features, R(s) = w · φ(s), and find weights w such that the policy's feature expectations match the expert's:

Eπ*t φ(st)] ≈ Eπexpertt φ(st)]
def feature_irl(expert_trajectories, feature_fn, n_features,
                grid_size=5, n_iters=100, lr=0.2):
    """Feature-matching IRL: learn reward weights from expert demos."""
    # Compute expert feature expectations
    expert_feat = np.zeros(n_features)
    for traj in expert_trajectories:
        for s in traj:
            expert_feat += feature_fn(s)
    expert_feat /= len(expert_trajectories)

    # Learn reward weights by matching feature expectations
    w = np.zeros(n_features)
    for iteration in range(n_iters):
        R = np.zeros((grid_size, grid_size))
        for r in range(grid_size):
            for c in range(grid_size):
                R[r, c] = w @ feature_fn((r, c))

        # Boltzmann state distribution: P(s) ∝ exp(R(s))
        flat_R = R.flatten()
        probs = np.exp(flat_R - flat_R.max())
        probs /= probs.sum()

        # Expected features under current reward
        policy_feat = np.zeros(n_features)
        for r in range(grid_size):
            for c in range(grid_size):
                policy_feat += probs[r * grid_size + c] * feature_fn((r, c))

        # Gradient: expert features - policy features
        w += lr * (expert_feat - policy_feat)

    # Recompute final reward grid
    R = np.zeros((grid_size, grid_size))
    for r in range(grid_size):
        for c in range(grid_size):
            R[r, c] = w @ feature_fn((r, c))
    return w, R

# Setup: 5x5 grid, expert navigates to (4,4)
def make_features(grid_size=5):
    """Features: row, column, proximity to goal."""
    goal = (grid_size-1, grid_size-1)
    max_d = np.sqrt(2) * (grid_size - 1)
    def feat(s):
        d = np.sqrt((s[0]-goal[0])**2 + (s[1]-goal[1])**2)
        return np.array([s[0]/(grid_size-1), s[1]/(grid_size-1), 1 - d/max_d])
    return feat, 3

feat_fn, n_feat = make_features()

# Expert trajectories: go from (0,0) to (4,4)
expert_trajs = [
    [(0,0),(1,1),(2,2),(3,3),(4,4)],
    [(0,0),(0,1),(1,2),(2,3),(3,4),(4,4)],
    [(0,0),(1,0),(2,1),(3,2),(4,3),(4,4)],
]

np.random.seed(42)
w_learned, R_learned = feature_irl(expert_trajs, feat_fn, n_feat)
print("Learned reward weights:", np.round(w_learned, 3))
# Normalize for display
R_n = (R_learned - R_learned.min()) / (R_learned.max() - R_learned.min())
R_d = R_n * 6 - 5
print("\nRecovered reward grid (normalized):")
for r in range(5):
    print("  ", [f"{R_d[r,c]:+.2f}" for c in range(5)])
# Learned reward weights: [37.194 37.194 36.061]
#
# Recovered reward grid (normalized):
#   ['-5.00', '-4.27', '-3.58', '-2.95', '-2.41']
#   ['-4.27', '-3.50', '-2.77', '-2.12', '-1.55']
#   ['-3.58', '-2.77', '-2.00', '-1.29', '-0.70']
#   ['-2.95', '-2.12', '-1.29', '-0.50', '+0.15']
#   ['-2.41', '-1.55', '-0.70', '+0.15', '+1.00']

The IRL algorithm recovers a reward function that peaks at (4,4) — the expert's goal — with a smooth gradient decreasing toward (0,0). All three feature weights are positive and large, meaning the algorithm learned "go to high rows, high columns, and stay close to the goal." But with different expert demonstrations, different reward functions could emerge — the ambiguity problem remains. We need a principled way to choose among valid reward functions.

6. Maximum Entropy IRL

Ziebart et al. (2008) resolved the reward ambiguity with an elegant principle: among all reward functions consistent with the expert, choose the one that makes the expert's behavior most likely while being maximally uncertain about everything else. This is the maximum entropy principle.

Under MaxEnt IRL, the expert's trajectory distribution is a Boltzmann distribution:

P(τ) ∝ exp(Σt R(st))

High-reward trajectories are exponentially more probable. The log-likelihood gradient has a beautiful form:

w L = Eexpert[φ] − Epolicy[φ]

We push the reward weights to make the expert's feature expectations more likely and the policy's current feature expectations less likely. This is identical in spirit to contrastive divergence in energy-based models — the expert is the data, the policy is the model.

def maxent_irl(expert_trajectories, feature_fn, n_features,
               grid_size=5, gamma=0.9, n_iters=100, lr=0.1):
    """Maximum Entropy IRL with soft value iteration."""
    expert_feat = np.zeros(n_features)
    for traj in expert_trajectories:
        for t, s in enumerate(traj):
            expert_feat += (gamma ** t) * feature_fn(s)
    expert_feat /= len(expert_trajectories)

    w = np.zeros(n_features)
    actions = [(-1,0),(1,0),(0,-1),(0,1),(0,0)]
    for iteration in range(n_iters):
        # Compute reward for each state
        R = np.zeros((grid_size, grid_size))
        for r in range(grid_size):
            for c in range(grid_size):
                R[r, c] = w @ feature_fn((r, c))

        # Soft value iteration (Bellman backup with log-sum-exp)
        V = np.zeros((grid_size, grid_size))
        for _ in range(30):
            V_new = np.full_like(V, -1e9)
            for r in range(grid_size):
                for c in range(grid_size):
                    vals = []
                    for dr, dc in actions:
                        nr = max(0, min(grid_size-1, r+dr))
                        nc = max(0, min(grid_size-1, c+dc))
                        vals.append(R[r,c] + gamma * V[nr, nc])
                    max_v = max(vals)
                    V_new[r,c] = max_v + np.log(
                        sum(np.exp(v - max_v) for v in vals))
            V = V_new

        # Softmax policy rollouts for feature expectations
        policy_feat = np.zeros(n_features)
        np.random.seed(iteration * 17 + 7)
        for _ in range(30):
            s = (0, 0)
            for t in range(15):
                r, c = s
                q_vals, next_states = [], []
                for dr, dc in actions:
                    nr = max(0, min(grid_size-1, r+dr))
                    nc = max(0, min(grid_size-1, c+dc))
                    q_vals.append(R[r,c] + gamma * V[nr, nc])
                    next_states.append((nr, nc))
                q_arr = np.array(q_vals)
                probs = np.exp(q_arr - q_arr.max())
                probs /= probs.sum()
                idx = np.random.choice(len(actions), p=probs)
                s = next_states[idx]
                policy_feat += (gamma ** t) * feature_fn(s)
        policy_feat /= 30
        w += lr * (expert_feat - policy_feat)

    # Recompute final reward grid
    R = np.zeros((grid_size, grid_size))
    for r in range(grid_size):
        for c in range(grid_size):
            R[r, c] = w @ feature_fn((r, c))
    return w, R

np.random.seed(42)
w_maxent, R_maxent = maxent_irl(expert_trajs, feat_fn, n_feat)
print("MaxEnt reward weights:", np.round(w_maxent, 3))
R_n = (R_maxent - R_maxent.min()) / (R_maxent.max() - R_maxent.min())
R_d = R_n * 6 - 5
print("\nMaxEnt recovered reward grid (normalized):")
for r in range(5):
    print("  ", [f"{R_d[r,c]:+.2f}" for c in range(5)])
# MaxEnt reward weights: [-0.413 -0.289  1.181]
#
# MaxEnt recovered reward grid (normalized):
#   ['-4.27', '-3.55', '-3.13', '-3.13', '-3.64']
#   ['-3.89', '-2.95', '-2.28', '-2.06', '-2.48']
#   ['-3.82', '-2.62', '-1.63', '-1.07', '-1.32']
#   ['-4.15', '-2.74', '-1.41', '-0.32', '-0.16']
#   ['-5.00', '-3.50', '-2.00', '-0.50', '+1.00']

MaxEnt IRL produces a different reward landscape: the proximity-to-goal feature dominates (weight 1.181), while the row and column features have small negative weights. This means MaxEnt learned that closeness to (4,4) is what matters, rather than just "go to high rows and columns." The maximum entropy principle resolves the ambiguity — it picks the reward function that makes the expert's behavior most likely while being maximally noncommittal about everything else.

This is the intellectual ancestor of modern reward modeling in RLHF: learn what humans want from their behavior, then optimize a policy to maximize that learned reward.

7. Modern Connections

Imitation learning's ideas permeate modern AI, often under different names:

For deep dives into these downstream techniques, see our posts on RLHF from scratch and DPO from scratch.

8. Conclusion

We've built the full imitation learning progression: behavioral cloning as simple supervised learning, the distributional shift problem that makes BC fail catastrophically over long horizons (O(T2) errors), DAgger's elegant fix of querying the expert on the learner's own states, and inverse reinforcement learning's deeper ambition of recovering the expert's hidden reward function, with MaxEnt IRL providing a principled resolution to reward ambiguity.

The central insight is that learning from demonstrations is not just supervised learning. The sequential nature of decision-making creates a unique challenge — distributional shift — that has no analogue in i.i.d. classification. And the opportunity to recover intent rather than just behavior opens the door to transfer, generalization, and ultimately alignment.

Every fine-tuned LLM begins with behavioral cloning. Every RLHF pipeline begins with inverse RL. Imitation learning isn't a niche topic — it's the foundation of modern AI alignment.

References & Further Reading

Try It: Behavioral Cloning vs DAgger

Watch BC drift off the expert's path while DAgger stays on track. Each DAgger round adds expert corrections at the learner's visited states.

Try It: Inverse RL Explorer

Click cells to place expert waypoints (green path), then run IRL to recover the hidden reward. Toggle between true and recovered reward views.