← Back to Blog

Reinforcement Learning from Scratch: Teaching Machines to Learn from Consequences

The Third Paradigm

Every other post in this series hands the model a dataset. Supervised learning gets labels: "this image is a cat." Self-supervised learning manufactures its own labels from augmentations and masking. Generative models learn the shape of a distribution. But there's a third paradigm we haven't touched — one where there is no dataset at all.

In reinforcement learning, an agent wanders through a world, takes actions, and occasionally stumbles into a reward. Nobody tells it the right answer. Nobody hands it a gradient pointing toward "cat." It has to figure out, from sparse and delayed consequences, which of its past decisions were good and which were terrible. The data isn't given — the agent creates it through interaction.

This isn't just academic completeness. If you read the RLHF post in this series, you saw terms like policy, advantage, PPO, and KL penalty flying around. That post assumed you understood the RL machinery underneath. This post provides those foundations. By the end, every equation in the RLHF post will feel like an old friend.

We'll build the entire framework from scratch in pure Python and NumPy: Markov decision processes, value functions, Q-learning, exploration strategies, policy gradients, and actor-critic methods. Along the way, you'll watch an agent learn to navigate a gridworld, see three exploration strategies compete on a bandit problem, and train a policy network to balance a pole.

Markov Decision Processes — The Language of RL

Before an agent can learn anything, we need a mathematical framework for "an agent interacting with a world." That framework is the Markov Decision Process (MDP), and every RL algorithm operates within it.

An MDP is defined by five components:

The Markov property is the key assumption: the future depends only on the current state, not on how we got here. The entire history of the episode is compressed into the current state. This makes the math tractable — we only need to reason about "where am I now?", not "what was my full trajectory?"

The agent's goal is to find a policy π(a|s) — a mapping from states to action probabilities — that maximizes the expected return:

Gt = rt + γ rt+1 + γ² rt+2 + … = ∑k=0 γk rt+k

This is the discounted sum of all future rewards. The discount factor γ ensures this sum converges (it's a geometric series) and encodes a preference for sooner rewards over later ones — a dollar today beats a dollar tomorrow.

Let's build a concrete MDP — a simple gridworld:

import numpy as np

class Gridworld:
    """A 5x5 grid: agent navigates from start to goal, avoiding walls."""
    def __init__(self):
        self.size = 5
        self.start = (0, 0)
        self.goal = (4, 4)
        self.walls = {(1, 1), (2, 1), (3, 1), (1, 3), (2, 3)}
        self.actions = {0: (-1, 0), 1: (1, 0),   # up, down
                        2: (0, -1), 3: (0, 1)}    # left, right
        self.state = self.start

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        dr, dc = self.actions[action]
        nr, nc = self.state[0] + dr, self.state[1] + dc
        # Stay in place if hitting a wall or going out of bounds
        if (0 <= nr < self.size and 0 <= nc < self.size
                and (nr, nc) not in self.walls):
            self.state = (nr, nc)
        reward = 1.0 if self.state == self.goal else -0.01
        done = self.state == self.goal
        return self.state, reward, done

    def render(self):
        for r in range(self.size):
            row = ""
            for c in range(self.size):
                if (r, c) == self.state:    row += " A "
                elif (r, c) == self.goal:   row += " G "
                elif (r, c) in self.walls:  row += " # "
                else:                       row += " . "
            print(row)

This is a complete MDP. States are (row, col) tuples. Actions are integers 0-3. The transition is deterministic — move in the chosen direction unless you'd hit a wall. Reward is +1 at the goal, -0.01 per step. The agent's job: find the shortest path from (0,0) to (4,4) around the walls.

Value Functions and the Bellman Equation

Before we can learn a good policy, we need to measure how good a state is. The state-value function Vπ(s) tells us the expected return from state s if we follow policy π:

Vπ(s) = Eπ[Gt | St = s] = Eπ[rt + γ Vπ(St+1) | St = s]

Read that second form carefully — it's recursive. The value of a state equals the immediate reward plus the discounted value of wherever you end up. This is the Bellman equation, and it's the beating heart of RL.

There's also the action-value function Qπ(s, a) — the expected return from state s if we take action a and then follow π. This is the Q the RLHF post computes. While V tells you "how good is this state?", Q tells you "how good is this specific action in this state?"

When we ask for the optimal value function V*(s) — the best return achievable from any policy — the Bellman equation becomes the Bellman optimality equation:

V*(s) = maxa [R(s, a) + γ ∑s′ P(s′|s, a) V*(s′)]

The intuition: the value of a state under the best possible policy is the best action's reward plus the discounted value of the resulting state. Pick the best action at every step, and you get the optimal value.

Value iteration exploits this equation directly. Start with V(s) = 0 everywhere, then repeatedly apply the Bellman update until the values converge:

def value_iteration(env, gamma=0.99, theta=1e-6):
    """Find optimal value function by iterating the Bellman equation."""
    V = np.zeros((env.size, env.size))

    while True:
        delta = 0
        for r in range(env.size):
            for c in range(env.size):
                if (r, c) in env.walls or (r, c) == env.goal:
                    continue
                old_v = V[r, c]
                # Try each action, pick the best
                action_values = []
                for a in range(4):
                    dr, dc = env.actions[a]
                    nr = max(0, min(env.size - 1, r + dr))
                    nc = max(0, min(env.size - 1, c + dc))
                    if (nr, nc) in env.walls:
                        nr, nc = r, c  # bounce back
                    reward = 1.0 if (nr, nc) == env.goal else -0.01
                    action_values.append(reward + gamma * V[nr, nc])
                V[r, c] = max(action_values)
                delta = max(delta, abs(old_v - V[r, c]))
        if delta < theta:
            break

    # Extract policy: at each state, pick the best action
    policy = np.zeros((env.size, env.size), dtype=int)
    for r in range(env.size):
        for c in range(env.size):
            if (r, c) in env.walls or (r, c) == env.goal:
                continue
            best_a, best_val = 0, -float('inf')
            for a in range(4):
                dr, dc = env.actions[a]
                nr = max(0, min(env.size - 1, r + dr))
                nc = max(0, min(env.size - 1, c + dc))
                if (nr, nc) in env.walls:
                    nr, nc = r, c
                reward = 1.0 if (nr, nc) == env.goal else -0.01
                val = reward + gamma * V[nr, nc]
                if val > best_val:
                    best_a, best_val = a, val
            policy[r, c] = best_a
    return V, policy

env = Gridworld()
V, policy = value_iteration(env)
# V now contains the optimal value at each cell
# policy[r, c] gives the best action (0=up, 1=down, 2=left, 3=right)

After convergence, the value function forms a gradient pointing toward the goal — cells near the goal have high values (~1.0), cells far away have lower values, and the optimal policy at each cell simply follows the uphill direction. This is the same principle behind gradient ascent, except in state space instead of parameter space.

But there's a catch: value iteration requires knowing P(s′|s, a) — the full transition model. We had to hardcode the gridworld physics. What if we don't know how the world works?

Q-Learning — Learning Without a Model

Value iteration assumes we know the rules of the game — every transition probability, every reward. But in most real problems, the agent doesn't have a manual. It has to learn from experience, one step at a time.

Q-learning (Watkins, 1989) solves this by learning the action-value function Q(s, a) directly from interactions. No model needed. The update rule is beautifully simple:

Q(s, a) ← Q(s, a) + α [r + γ maxa′ Q(s′, a′) − Q(s, a)]

Let's unpack this. The term in brackets is the temporal difference (TD) error:

δ = r + γ maxa′ Q(s′, a′) − Q(s, a)

This is the surprise signal. The agent predicted Q(s, a) for this state-action pair, but the actual outcome was r + γ max Q(s′, a′). If δ > 0, the outcome was better than expected — increase Q. If δ < 0, worse than expected — decrease Q. The learning rate α controls how much to adjust.

If you've read the loss functions post, this should look familiar. The TD error is essentially a loss — we're minimizing the gap between our prediction Q(s, a) and the "target" r + γ max Q(s′, a′). The difference from supervised learning? The target moves as Q improves. We're chasing a moving goalpost.

Q-learning has two crucial properties. It's model-free — no transition probabilities needed, just interact and observe. And it's off-policy — it can learn the optimal policy even while following a different one (like a random explorer). Here's the full training loop:

def q_learning(env, episodes=500, alpha=0.1, gamma=0.99, epsilon=0.1):
    """Train a Q-table on the gridworld using Q-learning."""
    # Q-table: state (r, c) x action (4 directions)
    Q = np.zeros((env.size, env.size, 4))

    for episode in range(episodes):
        state = env.reset()
        done = False

        while not done:
            r, c = state
            # Epsilon-greedy: explore with prob epsilon, exploit otherwise
            if np.random.random() < epsilon:
                action = np.random.randint(4)
            else:
                action = np.argmax(Q[r, c])

            next_state, reward, done = env.step(action)
            nr, nc = next_state

            # The Q-learning update
            td_target = reward + gamma * np.max(Q[nr, nc]) * (1 - done)
            td_error = td_target - Q[r, c, action]
            Q[r, c, action] += alpha * td_error

            state = next_state

    return Q

env = Gridworld()
Q = q_learning(env, episodes=1000)
# Extract the learned policy
policy = np.argmax(Q, axis=2)  # Best action at each state
# action_names = {0: '↑', 1: '↓', 2: '←', 3: '→'}

After 1000 episodes of bumbling around, the Q-table converges to the optimal action-value function. At each cell, argmax Q[r, c] gives the best action. The agent has learned the shortest path to the goal — without ever being told how the gridworld works.

Try It: Gridworld Q-Learning

Watch a Q-learning agent discover the optimal path through a 7×7 grid. Each cell shows four directional arrows — brighter green means higher Q-value. The agent (blue) learns which directions lead to the goal (gold).

Epsilon (ε): 0.30
Speed: 50
Episode: 0 Steps: 0 Total Reward: 0.00 Best Path: —

The Exploration-Exploitation Dilemma

You've probably noticed the epsilon parameter lurking in the Q-learning code. This controls a fundamental tension in RL: exploration vs. exploitation.

Should the agent go to the restaurant it knows is good (exploit), or try the new place that might be better (explore)? Exploit too much and you miss better options. Explore too much and you waste time on bad choices. Every RL agent must navigate this tradeoff.

Three classic strategies:

ε-greedy: With probability ε, take a random action. Otherwise, take the greedy (best-known) action. Simple and widely used. The downside: it explores uniformly — a clearly terrible action gets the same exploration probability as a promising one.

Softmax (Boltzmann) exploration: Choose actions proportionally to their estimated value using — you guessed it — the softmax function:

P(a) = exp(Q(s, a) / τ) / ∑a′ exp(Q(s, a′) / τ)

If you've read the softmax and temperature post, this is exactly the same mechanism. Temperature τ controls the tradeoff: high τ → uniform (explore), low τ → peaked (exploit). The agent explores more where Q-values are uncertain, less where one action is clearly dominant.

Upper Confidence Bound (UCB): Be optimistic about actions you haven't tried much. Select the action that maximizes Q(s, a) + c ⋅ √(ln(t) / N(s, a)), where N(s, a) counts how many times we've tried action a in state s. Rarely-tried actions get a bonus that shrinks as we gather more data. This naturally balances exploration (try the uncertain) with exploitation (prefer the known-good).

def epsilon_greedy(Q, state, epsilon):
    """Random action with prob epsilon, greedy otherwise."""
    if np.random.random() < epsilon:
        return np.random.randint(len(Q[state]))
    return np.argmax(Q[state])

def softmax_explore(Q, state, tau=1.0):
    """Sample action proportionally to exp(Q/tau)."""
    q = Q[state]
    q_shifted = q - np.max(q)  # numerical stability
    probs = np.exp(q_shifted / tau) / np.sum(np.exp(q_shifted / tau))
    return np.random.choice(len(q), p=probs)

def ucb_select(Q, N, state, t, c=2.0):
    """Select action maximizing Q + exploration bonus."""
    n = N[state]
    # Avoid division by zero for untried actions (infinite bonus)
    bonus = c * np.sqrt(np.log(t + 1) / (n + 1e-8))
    return np.argmax(Q[state] + bonus)

# --- Multi-armed bandit comparison ---
n_arms = 5
true_rewards = np.random.randn(n_arms) * 0.5 + 1.0  # hidden means
n_pulls = 1000

for strategy_name, select_fn in [("e-greedy", "eg"),
                                  ("softmax", "sm"),
                                  ("UCB", "ucb")]:
    Q_est = np.zeros(n_arms)
    N_count = np.zeros(n_arms)
    total_reward = 0

    for t in range(n_pulls):
        if select_fn == "eg":
            arm = epsilon_greedy(Q_est, slice(None), epsilon=0.1)
        elif select_fn == "sm":
            arm = softmax_explore(Q_est, slice(None), tau=0.5)
        else:
            arm = ucb_select(Q_est, N_count, slice(None), t)

        reward = np.random.randn() * 0.3 + true_rewards[arm]
        N_count[arm] += 1
        Q_est[arm] += (reward - Q_est[arm]) / N_count[arm]  # running mean
        total_reward += reward

    best_possible = true_rewards.max() * n_pulls
    regret = best_possible - total_reward
    print(f"{strategy_name:>10}: reward={total_reward:.1f}, regret={regret:.1f}")

The multi-armed bandit is the purest form of the exploration-exploitation problem — no states, no transitions, just pick an arm and get a reward. UCB typically wins in theory (provably optimal regret bounds), but ε-greedy's simplicity makes it the default in practice. And softmax gives you the temperature knob — the same one that controls creativity in language model decoding.

Try It: Exploration Strategies

Five slot machines with hidden reward distributions. Three strategies compete in real time — watch which one finds the best arm fastest and accumulates the most reward. Regret measures the cost of not always picking the best arm.

Speed: 50
Pull: 0 / 1000 ε-greedy regret: 0.0 Softmax regret: 0.0 UCB regret: 0.0

Policy Gradients — Learning the Policy Directly

Q-learning stores a value for every (state, action) pair in a table. That works for a 5×5 grid with 4 actions (100 entries). But what about Atari (millions of pixel states)? Or a robot with continuous joint angles? The Q-table explodes.

Policy gradient methods take a different approach: instead of learning values and deriving a policy, learn the policy directly as a neural network πθ(a|s) that outputs action probabilities given a state. This is the πθ from the RLHF post — now you'll see where it comes from.

The objective is straightforward: maximize expected cumulative reward:

J(θ) = Eπθ[∑t γt rt]

The magic is in the gradient. The policy gradient theorem tells us:

∇J(θ) = E[∇θ log πθ(at|st) ⋅ Gt]

This is remarkable. To increase expected reward, adjust the policy parameters in the direction of ∇ log π weighted by the return Gt. Actions that led to high return get their probability increased. Actions that led to low return get decreased. The log probability ∇ log π shows up here and everywhere in the RLHF post — it's the fundamental currency of policy optimization.

The REINFORCE algorithm (Williams, 1992) implements this directly: run a complete episode, compute returns, update the policy:

class PolicyNetwork:
    """A simple 2-layer policy network for discrete actions."""
    def __init__(self, state_dim, hidden_dim, n_actions, lr=0.01):
        # Xavier initialization
        self.W1 = np.random.randn(state_dim, hidden_dim) * np.sqrt(2.0 / state_dim)
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, n_actions) * np.sqrt(2.0 / hidden_dim)
        self.b2 = np.zeros(n_actions)
        self.lr = lr

    def forward(self, state):
        """Forward pass: state -> action probabilities."""
        self.z1 = state @ self.W1 + self.b1
        self.h1 = np.maximum(0, self.z1)  # ReLU
        self.logits = self.h1 @ self.W2 + self.b2
        # Softmax for action probabilities
        exp_logits = np.exp(self.logits - np.max(self.logits))
        self.probs = exp_logits / exp_logits.sum()
        return self.probs

    def select_action(self, state):
        """Sample an action from the policy distribution."""
        probs = self.forward(state)
        action = np.random.choice(len(probs), p=probs)
        return action

    def update(self, states, actions, returns):
        """REINFORCE update: increase prob of actions with high returns."""
        for state, action, G in zip(states, actions, returns):
            probs = self.forward(state)
            # Gradient of log pi(a|s) for the chosen action
            # d/d(logits) log(softmax(logits)[a]) = (one_hot(a) - probs)
            grad_logits = -probs.copy()
            grad_logits[action] += 1.0  # one_hot - softmax

            # Scale by return: good returns -> increase this action's prob
            grad_logits *= G

            # Backprop through the network
            grad_W2 = self.h1.reshape(-1, 1) @ grad_logits.reshape(1, -1)
            grad_b2 = grad_logits
            grad_h1 = grad_logits @ self.W2.T
            grad_z1 = grad_h1 * (self.z1 > 0)  # ReLU gradient
            grad_W1 = state.reshape(-1, 1) @ grad_z1.reshape(1, -1)
            grad_b1 = grad_z1

            # Gradient ascent (we want to maximize return)
            self.W2 += self.lr * grad_W2
            self.b2 += self.lr * grad_b2
            self.W1 += self.lr * grad_W1
            self.b1 += self.lr * grad_b1

The key line is grad_logits[action] += 1.0 — this computes the gradient of log π(a|s) with respect to the logits. It's the same softmax derivative from the softmax post, now serving a completely different purpose: steering a policy instead of classifying images.

Baselines and Advantage — Reducing Variance

REINFORCE works in principle, but it has a nasty problem: high variance. If every episode produces a positive return (common when rewards are non-negative), then every action's probability increases — good actions a lot, bad actions a little. The signal is noisy because we're using raw returns that swing wildly between episodes.

The fix: subtract a baseline b from the return. Instead of weighting by Gt, weight by Gt − b. This doesn't change the expected gradient (it's still unbiased) but dramatically reduces variance.

The best baseline is the value function V(s) — the expected return from state s. Then Gt − V(st) is the advantage:

At = Gt − V(st)

The advantage measures "how much better was this action than average?" If At > 0, the action was better than expected — increase its probability. If At < 0, worse than expected — decrease it. If you read the RLHF post, this is exactly the advantages variable in the PPO update. Now you know what it means.

This leads to the actor-critic architecture: two networks working together. The actor (policy network πθ) decides what to do. The critic (value network Vφ) evaluates how good the decision was. The critic provides the baseline that reduces the actor's gradient variance:

class ActorCritic:
    """Actor (policy) + Critic (value function) for variance reduction."""
    def __init__(self, state_dim, hidden_dim, n_actions, lr=0.005):
        # Shared first layer, separate heads
        s = np.sqrt(2.0 / state_dim)
        h = np.sqrt(2.0 / hidden_dim)
        self.W1 = np.random.randn(state_dim, hidden_dim) * s
        self.b1 = np.zeros(hidden_dim)
        # Actor head: outputs action probabilities
        self.W_actor = np.random.randn(hidden_dim, n_actions) * h
        self.b_actor = np.zeros(n_actions)
        # Critic head: outputs state value (scalar)
        self.W_critic = np.random.randn(hidden_dim, 1) * h
        self.b_critic = np.zeros(1)
        self.lr = lr

    def forward(self, state):
        self.z1 = state @ self.W1 + self.b1
        self.h1 = np.maximum(0, self.z1)
        # Actor: action probabilities
        logits = self.h1 @ self.W_actor + self.b_actor
        exp_l = np.exp(logits - np.max(logits))
        self.probs = exp_l / exp_l.sum()
        # Critic: state value
        self.value = (self.h1 @ self.W_critic + self.b_critic)[0]
        return self.probs, self.value

    def update(self, states, actions, returns):
        for state, action, G in zip(states, actions, returns):
            probs, value = self.forward(state)
            advantage = G - value  # How much better than expected?

            # Actor update: policy gradient scaled by advantage
            grad_logits = -probs.copy()
            grad_logits[action] += 1.0
            grad_logits *= advantage

            # Critic update: minimize (G - V(s))^2
            critic_grad = -2.0 * advantage  # d/dV (G - V)^2

            # Backprop actor head
            grad_W_actor = self.h1.reshape(-1, 1) @ grad_logits.reshape(1, -1)
            grad_h1_actor = grad_logits @ self.W_actor.T

            # Backprop critic head
            grad_W_critic = self.h1.reshape(-1, 1) * critic_grad
            grad_h1_critic = (self.W_critic * critic_grad).flatten()

            # Combined gradient through shared layer
            grad_h1 = grad_h1_actor + grad_h1_critic * 0.5  # scale critic
            grad_z1 = grad_h1 * (self.z1 > 0)

            # Gradient ascent for actor, descent for critic
            self.W_actor += self.lr * grad_W_actor
            self.W_critic -= self.lr * 0.5 * grad_W_critic
            self.W1 += self.lr * state.reshape(-1, 1) @ grad_z1.reshape(1, -1)
            self.b1 += self.lr * grad_z1

The actor-critic architecture is the backbone of modern RL. In the RLHF post, the language model is the actor and a separate value network is the critic. PPO, the algorithm used to fine-tune ChatGPT, is an actor-critic method. We're building toward it.

Try It: Policy Gradient Balancer

A pole balances on a cart. The policy network learns to push left or right to keep it upright. Toggle between REINFORCE (high variance, learns slowly) and Actor-Critic (lower variance, learns faster). Watch the episode length grow as the agent improves.

Episode: 0 Length: 0 Best: 0 Method: REINFORCE

From Policy Gradients to PPO — The Bridge to RLHF

We've built the full toolkit: Q-learning for tabular problems, REINFORCE for simple policy optimization, and actor-critic for variance reduction. But there's one more piece needed to connect to RLHF: Proximal Policy Optimization (PPO).

The problem with vanilla policy gradients: large updates can be catastrophic. If the policy changes too much in one step, it collects bad data under the new policy, which makes the next update worse, which makes the data worse — a death spiral. We need to constrain how much the policy can change.

PPO (Schulman et al., 2017) solves this elegantly. Compute the ratio between the new and old policy:

rt(θ) = πθ(at|st) / πθold(at|st)

If the ratio is 1.0, the policy hasn't changed. If it's 2.0, the new policy is twice as likely to take this action. PPO clips this ratio to stay within [1 − ε, 1 + ε] (typically ε = 0.2):

LCLIP = min(rt ⋅ At, clip(rt, 1−ε, 1+ε) ⋅ At)

This is the exact same clipped objective from the RLHF post. Now you know where it comes from — it's the natural solution to the "don't change too much" problem in policy optimization.

def ppo_clipped_loss(log_probs_new, log_probs_old, advantages, epsilon=0.2):
    """PPO's clipped surrogate objective.

    log_probs_new: log pi_theta(a|s) for each token/action
    log_probs_old: log pi_old(a|s) from data collection
    advantages:    A_t = G_t - V(s_t)
    """
    # Policy ratio: how much did the policy change?
    ratio = np.exp(log_probs_new - log_probs_old)

    # Unclipped objective: ratio * advantage
    unclipped = ratio * advantages

    # Clipped objective: keep ratio in [1-eps, 1+eps]
    clipped = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages

    # Take the minimum — pessimistic bound prevents overconfident updates
    loss = np.minimum(unclipped, clipped).mean()
    return loss

The RLHF objective adds one more ingredient — a KL divergence penalty that prevents the fine-tuned model πθ from drifting too far from the reference model πref:

maxθ E[r(x, y)] − β ⋅ DKLθ || πref)

The KL penalty is exploration control at a higher level — it keeps the language model close to its pre-trained distribution, just as PPO's clipping keeps the policy close to its previous version. Every concept in this equation was built from scratch in this post: the policy πθ, the reward r, the advantage hidden inside PPO, and the KL constraint. The RLHF post applies them to language models.

Where Reinforcement Learning Fits in the Series

RL touches nearly every other topic we've covered. Here's the map:

Reinforcement learning is the paradigm that closes the loop. Supervised learning tells a model what the right answer is. Self-supervised learning lets a model invent its own puzzles. But RL hands a model an environment and says: figure it out. From the Bellman equation to PPO, the ideas in this post power everything from game-playing agents to the RLHF pipeline that fine-tunes the language models we use every day. The next time you see a chatbot that's surprisingly good at following instructions, remember: underneath the hood, it's a policy network, optimized by an actor-critic algorithm, balancing exploration and exploitation — exactly the machinery we just built from scratch.

References & Further Reading