Reinforcement Learning from Scratch: Teaching Machines to Learn from Consequences
The Third Paradigm
Every other post in this series hands the model a dataset. Supervised learning gets labels: "this image is a cat." Self-supervised learning manufactures its own labels from augmentations and masking. Generative models learn the shape of a distribution. But there's a third paradigm we haven't touched — one where there is no dataset at all.
In reinforcement learning, an agent wanders through a world, takes actions, and occasionally stumbles into a reward. Nobody tells it the right answer. Nobody hands it a gradient pointing toward "cat." It has to figure out, from sparse and delayed consequences, which of its past decisions were good and which were terrible. The data isn't given — the agent creates it through interaction.
This isn't just academic completeness. If you read the RLHF post in this series, you saw terms like policy, advantage, PPO, and KL penalty flying around. That post assumed you understood the RL machinery underneath. This post provides those foundations. By the end, every equation in the RLHF post will feel like an old friend.
We'll build the entire framework from scratch in pure Python and NumPy: Markov decision processes, value functions, Q-learning, exploration strategies, policy gradients, and actor-critic methods. Along the way, you'll watch an agent learn to navigate a gridworld, see three exploration strategies compete on a bandit problem, and train a policy network to balance a pole.
Markov Decision Processes — The Language of RL
Before an agent can learn anything, we need a mathematical framework for "an agent interacting with a world." That framework is the Markov Decision Process (MDP), and every RL algorithm operates within it.
An MDP is defined by five components:
- States (S) — what the agent can observe. A gridworld position, a game screen, the joint angles of a robot arm.
- Actions (A) — what the agent can do. Move up, move right, apply 3 newtons of torque.
- Transition function P(s′|s, a) — how the world responds. Given a state and action, what's the probability of ending up in each next state? This can be deterministic (chess board moves) or stochastic (dice rolls).
- Reward function R(s, a, s′) — the signal that guides learning. +1 for reaching the goal, -1 for falling off a cliff, -0.01 for each step (to encourage efficiency).
- Discount factor γ ∈ [0, 1) — how much the agent cares about future vs. immediate rewards. A reward of 1.0 received 10 steps from now is worth γ10 today. With γ = 0.99, that's 0.90. With γ = 0.5, it's only 0.001.
The Markov property is the key assumption: the future depends only on the current state, not on how we got here. The entire history of the episode is compressed into the current state. This makes the math tractable — we only need to reason about "where am I now?", not "what was my full trajectory?"
The agent's goal is to find a policy π(a|s) — a mapping from states to action probabilities — that maximizes the expected return:
This is the discounted sum of all future rewards. The discount factor γ ensures this sum converges (it's a geometric series) and encodes a preference for sooner rewards over later ones — a dollar today beats a dollar tomorrow.
Let's build a concrete MDP — a simple gridworld:
import numpy as np
class Gridworld:
"""A 5x5 grid: agent navigates from start to goal, avoiding walls."""
def __init__(self):
self.size = 5
self.start = (0, 0)
self.goal = (4, 4)
self.walls = {(1, 1), (2, 1), (3, 1), (1, 3), (2, 3)}
self.actions = {0: (-1, 0), 1: (1, 0), # up, down
2: (0, -1), 3: (0, 1)} # left, right
self.state = self.start
def reset(self):
self.state = self.start
return self.state
def step(self, action):
dr, dc = self.actions[action]
nr, nc = self.state[0] + dr, self.state[1] + dc
# Stay in place if hitting a wall or going out of bounds
if (0 <= nr < self.size and 0 <= nc < self.size
and (nr, nc) not in self.walls):
self.state = (nr, nc)
reward = 1.0 if self.state == self.goal else -0.01
done = self.state == self.goal
return self.state, reward, done
def render(self):
for r in range(self.size):
row = ""
for c in range(self.size):
if (r, c) == self.state: row += " A "
elif (r, c) == self.goal: row += " G "
elif (r, c) in self.walls: row += " # "
else: row += " . "
print(row)
This is a complete MDP. States are (row, col) tuples. Actions are integers 0-3. The transition is deterministic — move in the chosen direction unless you'd hit a wall. Reward is +1 at the goal, -0.01 per step. The agent's job: find the shortest path from (0,0) to (4,4) around the walls.
Value Functions and the Bellman Equation
Before we can learn a good policy, we need to measure how good a state is. The state-value function Vπ(s) tells us the expected return from state s if we follow policy π:
Read that second form carefully — it's recursive. The value of a state equals the immediate reward plus the discounted value of wherever you end up. This is the Bellman equation, and it's the beating heart of RL.
There's also the action-value function Qπ(s, a) — the expected return from state s if we take action a and then follow π. This is the Q the RLHF post computes. While V tells you "how good is this state?", Q tells you "how good is this specific action in this state?"
When we ask for the optimal value function V*(s) — the best return achievable from any policy — the Bellman equation becomes the Bellman optimality equation:
The intuition: the value of a state under the best possible policy is the best action's reward plus the discounted value of the resulting state. Pick the best action at every step, and you get the optimal value.
Value iteration exploits this equation directly. Start with V(s) = 0 everywhere, then repeatedly apply the Bellman update until the values converge:
def value_iteration(env, gamma=0.99, theta=1e-6):
"""Find optimal value function by iterating the Bellman equation."""
V = np.zeros((env.size, env.size))
while True:
delta = 0
for r in range(env.size):
for c in range(env.size):
if (r, c) in env.walls or (r, c) == env.goal:
continue
old_v = V[r, c]
# Try each action, pick the best
action_values = []
for a in range(4):
dr, dc = env.actions[a]
nr = max(0, min(env.size - 1, r + dr))
nc = max(0, min(env.size - 1, c + dc))
if (nr, nc) in env.walls:
nr, nc = r, c # bounce back
reward = 1.0 if (nr, nc) == env.goal else -0.01
action_values.append(reward + gamma * V[nr, nc])
V[r, c] = max(action_values)
delta = max(delta, abs(old_v - V[r, c]))
if delta < theta:
break
# Extract policy: at each state, pick the best action
policy = np.zeros((env.size, env.size), dtype=int)
for r in range(env.size):
for c in range(env.size):
if (r, c) in env.walls or (r, c) == env.goal:
continue
best_a, best_val = 0, -float('inf')
for a in range(4):
dr, dc = env.actions[a]
nr = max(0, min(env.size - 1, r + dr))
nc = max(0, min(env.size - 1, c + dc))
if (nr, nc) in env.walls:
nr, nc = r, c
reward = 1.0 if (nr, nc) == env.goal else -0.01
val = reward + gamma * V[nr, nc]
if val > best_val:
best_a, best_val = a, val
policy[r, c] = best_a
return V, policy
env = Gridworld()
V, policy = value_iteration(env)
# V now contains the optimal value at each cell
# policy[r, c] gives the best action (0=up, 1=down, 2=left, 3=right)
After convergence, the value function forms a gradient pointing toward the goal — cells near the goal have high values (~1.0), cells far away have lower values, and the optimal policy at each cell simply follows the uphill direction. This is the same principle behind gradient ascent, except in state space instead of parameter space.
But there's a catch: value iteration requires knowing P(s′|s, a) — the full transition model. We had to hardcode the gridworld physics. What if we don't know how the world works?
Q-Learning — Learning Without a Model
Value iteration assumes we know the rules of the game — every transition probability, every reward. But in most real problems, the agent doesn't have a manual. It has to learn from experience, one step at a time.
Q-learning (Watkins, 1989) solves this by learning the action-value function Q(s, a) directly from interactions. No model needed. The update rule is beautifully simple:
Let's unpack this. The term in brackets is the temporal difference (TD) error:
This is the surprise signal. The agent predicted Q(s, a) for this state-action pair, but the actual outcome was r + γ max Q(s′, a′). If δ > 0, the outcome was better than expected — increase Q. If δ < 0, worse than expected — decrease Q. The learning rate α controls how much to adjust.
If you've read the loss functions post, this should look familiar. The TD error is essentially a loss — we're minimizing the gap between our prediction Q(s, a) and the "target" r + γ max Q(s′, a′). The difference from supervised learning? The target moves as Q improves. We're chasing a moving goalpost.
Q-learning has two crucial properties. It's model-free — no transition probabilities needed, just interact and observe. And it's off-policy — it can learn the optimal policy even while following a different one (like a random explorer). Here's the full training loop:
def q_learning(env, episodes=500, alpha=0.1, gamma=0.99, epsilon=0.1):
"""Train a Q-table on the gridworld using Q-learning."""
# Q-table: state (r, c) x action (4 directions)
Q = np.zeros((env.size, env.size, 4))
for episode in range(episodes):
state = env.reset()
done = False
while not done:
r, c = state
# Epsilon-greedy: explore with prob epsilon, exploit otherwise
if np.random.random() < epsilon:
action = np.random.randint(4)
else:
action = np.argmax(Q[r, c])
next_state, reward, done = env.step(action)
nr, nc = next_state
# The Q-learning update
td_target = reward + gamma * np.max(Q[nr, nc]) * (1 - done)
td_error = td_target - Q[r, c, action]
Q[r, c, action] += alpha * td_error
state = next_state
return Q
env = Gridworld()
Q = q_learning(env, episodes=1000)
# Extract the learned policy
policy = np.argmax(Q, axis=2) # Best action at each state
# action_names = {0: '↑', 1: '↓', 2: '←', 3: '→'}
After 1000 episodes of bumbling around, the Q-table converges to the optimal action-value function. At each cell, argmax Q[r, c] gives the best action. The agent has learned the shortest path to the goal — without ever being told how the gridworld works.
Try It: Gridworld Q-Learning
Watch a Q-learning agent discover the optimal path through a 7×7 grid. Each cell shows four directional arrows — brighter green means higher Q-value. The agent (blue) learns which directions lead to the goal (gold).
The Exploration-Exploitation Dilemma
You've probably noticed the epsilon parameter lurking in the Q-learning code. This controls a fundamental tension in RL: exploration vs. exploitation.
Should the agent go to the restaurant it knows is good (exploit), or try the new place that might be better (explore)? Exploit too much and you miss better options. Explore too much and you waste time on bad choices. Every RL agent must navigate this tradeoff.
Three classic strategies:
ε-greedy: With probability ε, take a random action. Otherwise, take the greedy (best-known) action. Simple and widely used. The downside: it explores uniformly — a clearly terrible action gets the same exploration probability as a promising one.
Softmax (Boltzmann) exploration: Choose actions proportionally to their estimated value using — you guessed it — the softmax function:
If you've read the softmax and temperature post, this is exactly the same mechanism. Temperature τ controls the tradeoff: high τ → uniform (explore), low τ → peaked (exploit). The agent explores more where Q-values are uncertain, less where one action is clearly dominant.
Upper Confidence Bound (UCB): Be optimistic about actions you haven't tried much. Select the action that maximizes Q(s, a) + c ⋅ √(ln(t) / N(s, a)), where N(s, a) counts how many times we've tried action a in state s. Rarely-tried actions get a bonus that shrinks as we gather more data. This naturally balances exploration (try the uncertain) with exploitation (prefer the known-good).
def epsilon_greedy(Q, state, epsilon):
"""Random action with prob epsilon, greedy otherwise."""
if np.random.random() < epsilon:
return np.random.randint(len(Q[state]))
return np.argmax(Q[state])
def softmax_explore(Q, state, tau=1.0):
"""Sample action proportionally to exp(Q/tau)."""
q = Q[state]
q_shifted = q - np.max(q) # numerical stability
probs = np.exp(q_shifted / tau) / np.sum(np.exp(q_shifted / tau))
return np.random.choice(len(q), p=probs)
def ucb_select(Q, N, state, t, c=2.0):
"""Select action maximizing Q + exploration bonus."""
n = N[state]
# Avoid division by zero for untried actions (infinite bonus)
bonus = c * np.sqrt(np.log(t + 1) / (n + 1e-8))
return np.argmax(Q[state] + bonus)
# --- Multi-armed bandit comparison ---
n_arms = 5
true_rewards = np.random.randn(n_arms) * 0.5 + 1.0 # hidden means
n_pulls = 1000
for strategy_name, select_fn in [("e-greedy", "eg"),
("softmax", "sm"),
("UCB", "ucb")]:
Q_est = np.zeros(n_arms)
N_count = np.zeros(n_arms)
total_reward = 0
for t in range(n_pulls):
if select_fn == "eg":
arm = epsilon_greedy(Q_est, slice(None), epsilon=0.1)
elif select_fn == "sm":
arm = softmax_explore(Q_est, slice(None), tau=0.5)
else:
arm = ucb_select(Q_est, N_count, slice(None), t)
reward = np.random.randn() * 0.3 + true_rewards[arm]
N_count[arm] += 1
Q_est[arm] += (reward - Q_est[arm]) / N_count[arm] # running mean
total_reward += reward
best_possible = true_rewards.max() * n_pulls
regret = best_possible - total_reward
print(f"{strategy_name:>10}: reward={total_reward:.1f}, regret={regret:.1f}")
The multi-armed bandit is the purest form of the exploration-exploitation problem — no states, no transitions, just pick an arm and get a reward. UCB typically wins in theory (provably optimal regret bounds), but ε-greedy's simplicity makes it the default in practice. And softmax gives you the temperature knob — the same one that controls creativity in language model decoding.
Try It: Exploration Strategies
Five slot machines with hidden reward distributions. Three strategies compete in real time — watch which one finds the best arm fastest and accumulates the most reward. Regret measures the cost of not always picking the best arm.
Policy Gradients — Learning the Policy Directly
Q-learning stores a value for every (state, action) pair in a table. That works for a 5×5 grid with 4 actions (100 entries). But what about Atari (millions of pixel states)? Or a robot with continuous joint angles? The Q-table explodes.
Policy gradient methods take a different approach: instead of learning values and deriving a policy, learn the policy directly as a neural network πθ(a|s) that outputs action probabilities given a state. This is the πθ from the RLHF post — now you'll see where it comes from.
The objective is straightforward: maximize expected cumulative reward:
The magic is in the gradient. The policy gradient theorem tells us:
This is remarkable. To increase expected reward, adjust the policy parameters in the direction of ∇ log π weighted by the return Gt. Actions that led to high return get their probability increased. Actions that led to low return get decreased. The log probability ∇ log π shows up here and everywhere in the RLHF post — it's the fundamental currency of policy optimization.
The REINFORCE algorithm (Williams, 1992) implements this directly: run a complete episode, compute returns, update the policy:
class PolicyNetwork:
"""A simple 2-layer policy network for discrete actions."""
def __init__(self, state_dim, hidden_dim, n_actions, lr=0.01):
# Xavier initialization
self.W1 = np.random.randn(state_dim, hidden_dim) * np.sqrt(2.0 / state_dim)
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(hidden_dim, n_actions) * np.sqrt(2.0 / hidden_dim)
self.b2 = np.zeros(n_actions)
self.lr = lr
def forward(self, state):
"""Forward pass: state -> action probabilities."""
self.z1 = state @ self.W1 + self.b1
self.h1 = np.maximum(0, self.z1) # ReLU
self.logits = self.h1 @ self.W2 + self.b2
# Softmax for action probabilities
exp_logits = np.exp(self.logits - np.max(self.logits))
self.probs = exp_logits / exp_logits.sum()
return self.probs
def select_action(self, state):
"""Sample an action from the policy distribution."""
probs = self.forward(state)
action = np.random.choice(len(probs), p=probs)
return action
def update(self, states, actions, returns):
"""REINFORCE update: increase prob of actions with high returns."""
for state, action, G in zip(states, actions, returns):
probs = self.forward(state)
# Gradient of log pi(a|s) for the chosen action
# d/d(logits) log(softmax(logits)[a]) = (one_hot(a) - probs)
grad_logits = -probs.copy()
grad_logits[action] += 1.0 # one_hot - softmax
# Scale by return: good returns -> increase this action's prob
grad_logits *= G
# Backprop through the network
grad_W2 = self.h1.reshape(-1, 1) @ grad_logits.reshape(1, -1)
grad_b2 = grad_logits
grad_h1 = grad_logits @ self.W2.T
grad_z1 = grad_h1 * (self.z1 > 0) # ReLU gradient
grad_W1 = state.reshape(-1, 1) @ grad_z1.reshape(1, -1)
grad_b1 = grad_z1
# Gradient ascent (we want to maximize return)
self.W2 += self.lr * grad_W2
self.b2 += self.lr * grad_b2
self.W1 += self.lr * grad_W1
self.b1 += self.lr * grad_b1
The key line is grad_logits[action] += 1.0 — this computes the gradient of log π(a|s) with respect to the logits. It's the same softmax derivative from the softmax post, now serving a completely different purpose: steering a policy instead of classifying images.
Baselines and Advantage — Reducing Variance
REINFORCE works in principle, but it has a nasty problem: high variance. If every episode produces a positive return (common when rewards are non-negative), then every action's probability increases — good actions a lot, bad actions a little. The signal is noisy because we're using raw returns that swing wildly between episodes.
The fix: subtract a baseline b from the return. Instead of weighting by Gt, weight by Gt − b. This doesn't change the expected gradient (it's still unbiased) but dramatically reduces variance.
The best baseline is the value function V(s) — the expected return from state s. Then Gt − V(st) is the advantage:
The advantage measures "how much better was this action than average?" If At > 0, the action was better than expected — increase its probability. If At < 0, worse than expected — decrease it. If you read the RLHF post, this is exactly the advantages variable in the PPO update. Now you know what it means.
This leads to the actor-critic architecture: two networks working together. The actor (policy network πθ) decides what to do. The critic (value network Vφ) evaluates how good the decision was. The critic provides the baseline that reduces the actor's gradient variance:
class ActorCritic:
"""Actor (policy) + Critic (value function) for variance reduction."""
def __init__(self, state_dim, hidden_dim, n_actions, lr=0.005):
# Shared first layer, separate heads
s = np.sqrt(2.0 / state_dim)
h = np.sqrt(2.0 / hidden_dim)
self.W1 = np.random.randn(state_dim, hidden_dim) * s
self.b1 = np.zeros(hidden_dim)
# Actor head: outputs action probabilities
self.W_actor = np.random.randn(hidden_dim, n_actions) * h
self.b_actor = np.zeros(n_actions)
# Critic head: outputs state value (scalar)
self.W_critic = np.random.randn(hidden_dim, 1) * h
self.b_critic = np.zeros(1)
self.lr = lr
def forward(self, state):
self.z1 = state @ self.W1 + self.b1
self.h1 = np.maximum(0, self.z1)
# Actor: action probabilities
logits = self.h1 @ self.W_actor + self.b_actor
exp_l = np.exp(logits - np.max(logits))
self.probs = exp_l / exp_l.sum()
# Critic: state value
self.value = (self.h1 @ self.W_critic + self.b_critic)[0]
return self.probs, self.value
def update(self, states, actions, returns):
for state, action, G in zip(states, actions, returns):
probs, value = self.forward(state)
advantage = G - value # How much better than expected?
# Actor update: policy gradient scaled by advantage
grad_logits = -probs.copy()
grad_logits[action] += 1.0
grad_logits *= advantage
# Critic update: minimize (G - V(s))^2
critic_grad = -2.0 * advantage # d/dV (G - V)^2
# Backprop actor head
grad_W_actor = self.h1.reshape(-1, 1) @ grad_logits.reshape(1, -1)
grad_h1_actor = grad_logits @ self.W_actor.T
# Backprop critic head
grad_W_critic = self.h1.reshape(-1, 1) * critic_grad
grad_h1_critic = (self.W_critic * critic_grad).flatten()
# Combined gradient through shared layer
grad_h1 = grad_h1_actor + grad_h1_critic * 0.5 # scale critic
grad_z1 = grad_h1 * (self.z1 > 0)
# Gradient ascent for actor, descent for critic
self.W_actor += self.lr * grad_W_actor
self.W_critic -= self.lr * 0.5 * grad_W_critic
self.W1 += self.lr * state.reshape(-1, 1) @ grad_z1.reshape(1, -1)
self.b1 += self.lr * grad_z1
The actor-critic architecture is the backbone of modern RL. In the RLHF post, the language model is the actor and a separate value network is the critic. PPO, the algorithm used to fine-tune ChatGPT, is an actor-critic method. We're building toward it.
Try It: Policy Gradient Balancer
A pole balances on a cart. The policy network learns to push left or right to keep it upright. Toggle between REINFORCE (high variance, learns slowly) and Actor-Critic (lower variance, learns faster). Watch the episode length grow as the agent improves.
From Policy Gradients to PPO — The Bridge to RLHF
We've built the full toolkit: Q-learning for tabular problems, REINFORCE for simple policy optimization, and actor-critic for variance reduction. But there's one more piece needed to connect to RLHF: Proximal Policy Optimization (PPO).
The problem with vanilla policy gradients: large updates can be catastrophic. If the policy changes too much in one step, it collects bad data under the new policy, which makes the next update worse, which makes the data worse — a death spiral. We need to constrain how much the policy can change.
PPO (Schulman et al., 2017) solves this elegantly. Compute the ratio between the new and old policy:
If the ratio is 1.0, the policy hasn't changed. If it's 2.0, the new policy is twice as likely to take this action. PPO clips this ratio to stay within [1 − ε, 1 + ε] (typically ε = 0.2):
This is the exact same clipped objective from the RLHF post. Now you know where it comes from — it's the natural solution to the "don't change too much" problem in policy optimization.
def ppo_clipped_loss(log_probs_new, log_probs_old, advantages, epsilon=0.2):
"""PPO's clipped surrogate objective.
log_probs_new: log pi_theta(a|s) for each token/action
log_probs_old: log pi_old(a|s) from data collection
advantages: A_t = G_t - V(s_t)
"""
# Policy ratio: how much did the policy change?
ratio = np.exp(log_probs_new - log_probs_old)
# Unclipped objective: ratio * advantage
unclipped = ratio * advantages
# Clipped objective: keep ratio in [1-eps, 1+eps]
clipped = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
# Take the minimum — pessimistic bound prevents overconfident updates
loss = np.minimum(unclipped, clipped).mean()
return loss
The RLHF objective adds one more ingredient — a KL divergence penalty that prevents the fine-tuned model πθ from drifting too far from the reference model πref:
The KL penalty is exploration control at a higher level — it keeps the language model close to its pre-trained distribution, just as PPO's clipping keeps the policy close to its previous version. Every concept in this equation was built from scratch in this post: the policy πθ, the reward r, the advantage hidden inside PPO, and the KL constraint. The RLHF post applies them to language models.
Where Reinforcement Learning Fits in the Series
RL touches nearly every other topic we've covered. Here's the map:
- RLHF from Scratch — This post provides the foundation. Policy, advantage, PPO, KL penalty — all defined here, applied to language models there.
- Loss Functions — The TD error in Q-learning is a loss, but with a moving target. Supervised losses have a fixed answer; RL's "answer" changes as the agent improves.
- Optimizers — Same Adam optimizer under the hood, but maximizing expected return instead of minimizing a fixed loss.
- Softmax & Temperature — Boltzmann exploration IS softmax with temperature applied to Q-values. Same mechanism, new context: action selection instead of classification.
- Decoding Strategies — Language model sampling is a policy. Temperature, top-k, top-p are all exploration strategies over the token action space.
- Micrograd — Policy gradients require backpropagation — the same chain rule from micrograd, applied to expected reward instead of a training loss.
- GANs — Both RL and GANs learn from non-stationary signals. The discriminator changes as the generator improves, just as the apparent reward distribution shifts as the policy improves.
- Embeddings — State representations in RL are embeddings — compressed representations of the environment that the agent learns to act on.
- Knowledge Distillation — The actor-critic architecture shares the teacher-student dynamic: the critic teaches the actor by providing value estimates as a learning signal.
Reinforcement learning is the paradigm that closes the loop. Supervised learning tells a model what the right answer is. Self-supervised learning lets a model invent its own puzzles. But RL hands a model an environment and says: figure it out. From the Bellman equation to PPO, the ideas in this post power everything from game-playing agents to the RLHF pipeline that fine-tunes the language models we use every day. The next time you see a chatbot that's surprisingly good at following instructions, remember: underneath the hood, it's a policy network, optimized by an actor-critic algorithm, balancing exploration and exploitation — exactly the machinery we just built from scratch.
References & Further Reading
- Sutton & Barto — Reinforcement Learning: An Introduction (2nd ed., 2018) — The definitive textbook, freely available online. Everything in this post and much more.
- Watkins & Dayan (1992) — "Q-Learning" — The original paper introducing Q-learning as a model-free RL algorithm.
- Williams (1992) — "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" — The REINFORCE algorithm — the first policy gradient method.
- Schulman et al. (2017) — "Proximal Policy Optimization Algorithms" — PPO, the algorithm behind RLHF fine-tuning of language models.
- David Silver's RL Course (UCL/DeepMind) — Outstanding video lectures covering all core RL concepts with clear mathematical derivations.
- OpenAI Spinning Up — Practical deep RL tutorial with clean, well-documented implementations.
- Lilian Weng — "A (Long) Peek into Reinforcement Learning" — Thorough and clearly written blog post overview of the RL landscape.