Bandit Algorithms from Scratch: The Explore-Exploit Dilemma That Powers Modern AI

February 27, 2026 · Elementary · 18 min read

The Multi-Armed Bandit Problem

You walk into a casino with 10 slot machines. Each has a different payout probability, but you don't know which is which. You have 1,000 pulls. How do you maximize your winnings?

If you always pull the machine that's paid out best so far, you might get stuck on a mediocre one that happened to win early. If you pull machines at random to learn more, you're wasting pulls on bad ones. This tension between exploitation (choosing what looks best) and exploration (trying uncertain options to learn more) is the multi-armed bandit problem — and it's everywhere.

Netflix deciding which thumbnail to show you? Bandit problem. A clinical trial allocating patients to treatments? Bandit problem. Your ad platform choosing which creative to serve this millisecond? Bandit problem. Any time an algorithm must make sequential decisions under uncertainty, it's facing the explore-exploit dilemma.

Formally, we have K arms (options), each with an unknown reward distribution with mean μ_k. At each time step t, we pick an arm and observe a reward. Our goal is to minimize cumulative regret — the gap between what the best arm would have earned and what we actually earned:

R_T = T · μ* − Σ_t=1..T r_t, where μ* = max_k μ_k

The multi-armed bandit is also the simplest possible reinforcement learning problem — there are no states, no transitions, just actions and rewards. Everything in RL builds on solving this fundamental dilemma first.

Let's start with the most naive approach and watch it fail.

import numpy as np

def bernoulli_bandit(K, seed=42):
    """Create a K-armed bandit with random Bernoulli reward probabilities."""
    rng = np.random.RandomState(seed)
    probs = rng.uniform(0.1, 0.9, size=K)
    return probs

def pull_arm(probs, arm, rng):
    """Pull an arm and get a Bernoulli reward (0 or 1)."""
    return 1 if rng.random() < probs[arm] else 0

def run_greedy(probs, T=1000, seed=0):
    """Pure greedy: always pick the arm with highest observed average."""
    K = len(probs)
    rng = np.random.RandomState(seed)
    counts = np.zeros(K)
    values = np.zeros(K)
    total_reward = 0

    # Pull each arm once to initialize
    for arm in range(K):
        reward = pull_arm(probs, arm, rng)
        counts[arm] = 1
        values[arm] = reward
        total_reward += reward

    # Then always exploit
    for t in range(K, T):
        arm = np.argmax(values)
        reward = pull_arm(probs, arm, rng)
        counts[arm] += 1
        values[arm] += (reward - values[arm]) / counts[arm]
        total_reward += reward

    best_arm = np.argmax(probs)
    regret = T * probs[best_arm] - total_reward
    return counts, values, regret, best_arm

probs = bernoulli_bandit(K=5, seed=42)
print(f"True arm probabilities: {probs.round(3)}")
print(f"Best arm: {np.argmax(probs)} (p={probs.max():.3f})\n")

counts, values, regret, best = run_greedy(probs, T=2000)
print(f"Greedy chose arm {np.argmax(counts)} most often ({int(counts.max())} pulls)")
print(f"Pull distribution: {counts.astype(int)}")
print(f"Cumulative regret: {regret:.1f}")

The greedy agent pulls each arm once, then locks onto whichever arm happened to pay out during that initial round. If arm 3 is truly best but arm 1 won on its first pull, greedy commits to arm 1 forever. With 5 arms, greedy picks a suboptimal arm roughly 40-60% of the time — a catastrophic failure rate for such a simple problem.

The root cause is clear: greedy never goes back to check if its initial impression was wrong. We need strategies that explore.

Epsilon-Greedy — The Simplest Explorer

The most intuitive fix is embarrassingly simple: most of the time, exploit (pick the best-looking arm), but with some small probability ε, explore (pick a random arm). This is ε-greedy, and despite its simplicity, it dominates pure greedy.

The key trade-off: a large ε (say 0.3) explores aggressively but wastes too many pulls on random arms even after you've identified the best one. A small ε (say 0.01) converges quickly but might not explore enough early on to find the true best arm.

A natural fix is to decay ε over time. Start with lots of exploration, then gradually focus on exploitation as your estimates improve. A classic schedule is ε_t = 1/√t — this explores heavily early on and tapers off, achieving sublinear regret.

def run_epsilon_greedy(probs, T=5000, epsilon=0.1, decay=False, seed=0):
    """Epsilon-greedy with optional decay schedule."""
    K = len(probs)
    rng = np.random.RandomState(seed)
    counts = np.zeros(K)
    values = np.zeros(K)
    regret_history = np.zeros(T)
    cumulative_regret = 0
    best_mean = probs.max()

    for t in range(T):
        eps = 1.0 / np.sqrt(t + 1) if decay else epsilon

        if rng.random() < eps:
            arm = rng.randint(K)
        else:
            arm = np.argmax(values)

        reward = pull_arm(probs, arm, rng)
        counts[arm] += 1
        values[arm] += (reward - values[arm]) / counts[arm]
        cumulative_regret += best_mean - probs[arm]
        regret_history[t] = cumulative_regret

    return regret_history, counts

probs = bernoulli_bandit(K=5, seed=42)

reg_fixed_10, _ = run_epsilon_greedy(probs, T=5000, epsilon=0.1, seed=0)
reg_fixed_01, _ = run_epsilon_greedy(probs, T=5000, epsilon=0.01, seed=0)
reg_decay, _    = run_epsilon_greedy(probs, T=5000, decay=True, seed=0)

print(f"eps=0.10 fixed  -> final regret: {reg_fixed_10[-1]:.1f}")
print(f"eps=0.01 fixed  -> final regret: {reg_fixed_01[-1]:.1f}")
print(f"eps=1/sqrt(t)   -> final regret: {reg_decay[-1]:.1f}")

But ε-greedy has a fundamental flaw: when it explores, it picks a random arm uniformly. If you've pulled arm 2 a thousand times and know its payout is 0.15, ε-greedy will still waste pulls on it. What if we could explore smarter — directing exploration toward arms we're genuinely uncertain about?

Upper Confidence Bound — Optimism in the Face of Uncertainty

The Upper Confidence Bound (UCB) algorithm is built on a beautiful principle: be optimistic about uncertain arms. Instead of picking the arm with the highest estimated reward, pick the arm with the highest optimistic estimate — the estimated reward plus a bonus for uncertainty.

The UCB1 formula (Auer, Cesa-Bianchi & Fischer, 2002) selects the arm that maximizes:

A_t = argmax_k [ Q̂_k + c · √(ln t / N_k) ]

Let's unpack this. Q̂_k is the estimated average reward of arm k — that's exploitation. The second term, c · √(ln t / N_k), is the confidence bonus — that's exploration. Notice what happens:

If you haven't pulled arm k much (N_k is small), the bonus is large → you're drawn to try it
If you've pulled it many times (N_k is large), the bonus shrinks → you rely on the estimated reward
As total time t grows, ln t grows slowly → the bonus increases just enough to ensure every arm gets revisited occasionally

The theoretical guarantee is striking. UCB1 achieves expected regret bounded by:

E[R_T] ≤ Σ_{i: Δ_i>0} (8 ln T) / Δ_i + (1 + π²/3) · Σ_j Δ_j

where Δ_i = μ* − μ_i is how much worse arm i is than the best. The regret grows only logarithmically in T — you pay a fixed cost to identify bad arms, then mostly pull the best one forever. And there's no ε to tune: UCB automatically balances exploration and exploitation.

If you've read our post on Bayesian optimization, this will look familiar — the acquisition functions there (GP-UCB, Expected Improvement) are bandit algorithms applied to function optimization.

def run_ucb1(probs, T=5000, c=np.sqrt(2), seed=0):
    """UCB1: always pick the arm with highest upper confidence bound."""
    K = len(probs)
    rng = np.random.RandomState(seed)
    counts = np.zeros(K)
    values = np.zeros(K)
    regret_history = np.zeros(T)
    cumulative_regret = 0
    best_mean = probs.max()

    # Pull each arm once
    for arm in range(K):
        reward = pull_arm(probs, arm, rng)
        counts[arm] = 1
        values[arm] = reward
        cumulative_regret += best_mean - probs[arm]
        regret_history[arm] = cumulative_regret

    for t in range(K, T):
        ucb_values = values + c * np.sqrt(np.log(t) / counts)
        arm = np.argmax(ucb_values)
        reward = pull_arm(probs, arm, rng)
        counts[arm] += 1
        values[arm] += (reward - values[arm]) / counts[arm]
        cumulative_regret += best_mean - probs[arm]
        regret_history[t] = cumulative_regret

    return regret_history, counts

probs = bernoulli_bandit(K=5, seed=42)
reg_ucb, counts_ucb = run_ucb1(probs, T=5000)
print(f"UCB1 final regret: {reg_ucb[-1]:.1f}")
print(f"Pull distribution: {counts_ucb.astype(int)}")
print(f"Best arm gets {int(counts_ucb[np.argmax(probs)])} of 5000 pulls")

Watch how UCB1 distributes its pulls: the best arm gets the vast majority, while suboptimal arms get just a handful of pulls — enough to confirm they're worse, and no more. Compare this to ε-greedy, which keeps wasting random pulls on every arm equally.

Thompson Sampling — The Bayesian Bandit

There's an even more elegant approach that predates UCB by 69 years. In 1933, William Thompson proposed a Bayesian solution: instead of computing confidence bounds, maintain a probability distribution (a posterior) over each arm's true payout, then sample from each posterior and pull whichever arm produced the highest sample.

For Bernoulli bandits, this is wonderfully clean. Each arm's payout probability is unknown, so we model it with a Beta distribution. We start with Beta(1, 1) — a uniform prior expressing total ignorance. Every time we pull an arm and see a success, we add 1 to α. Every failure, we add 1 to β. The posterior is always Beta(α, β), which is the beauty of conjugate priors (see our Bayesian inference post for the details).

The decision rule is delightfully simple: sample a value from each arm's Beta posterior, then pull the arm with the highest sample. Why does this explore? If an arm's posterior is wide (we're uncertain), its samples will occasionally be very high, causing us to try it. If a posterior is narrow and peaked at a low value (we're confident it's bad), its samples will almost never win. Exploration happens exactly when and where it's needed, driven by uncertainty rather than random chance.

def run_thompson(probs, T=5000, seed=0):
    """Thompson Sampling with Beta posteriors for Bernoulli bandits."""
    K = len(probs)
    rng = np.random.RandomState(seed)
    alpha = np.ones(K)  # Beta prior: starts at Beta(1,1) = uniform
    beta_param = np.ones(K)
    counts = np.zeros(K)
    regret_history = np.zeros(T)
    cumulative_regret = 0
    best_mean = probs.max()

    for t in range(T):
        # Sample from each arm's posterior
        samples = rng.beta(alpha, beta_param)
        arm = np.argmax(samples)

        # Pull arm and observe reward
        reward = pull_arm(probs, arm, rng)

        # Update posterior
        alpha[arm] += reward
        beta_param[arm] += 1 - reward
        counts[arm] += 1

        cumulative_regret += best_mean - probs[arm]
        regret_history[t] = cumulative_regret

    return regret_history, counts, alpha, beta_param

probs = bernoulli_bandit(K=5, seed=42)
reg_ts, counts_ts, final_alpha, final_beta = run_thompson(probs, T=5000)
print(f"Thompson Sampling final regret: {reg_ts[-1]:.1f}")
print(f"Pull distribution: {counts_ts.astype(int)}")
print(f"\nPosterior parameters (alpha, beta) per arm:")
for i in range(len(probs)):
    mean_est = final_alpha[i] / (final_alpha[i] + final_beta[i])
    print(f"  Arm {i}: Beta({final_alpha[i]:.0f}, {final_beta[i]:.0f})"
          f" -> mean={mean_est:.3f} (true={probs[i]:.3f})")

Look at the posterior parameters: the best arm has a large α and small β (many successes), giving a tight posterior peaked at its true probability. Suboptimal arms have small α values — Thompson barely wasted pulls on them. The posteriors converge to the true probabilities, and exploration naturally dies off as uncertainty shrinks.

In 2012, Kaufmann, Korda, and Munos proved what practitioners had long suspected: Thompson Sampling is asymptotically optimal. It matches the Lai-Robbins lower bound (more on that next), making it both the oldest and one of the best bandit algorithms known. Sometimes the simplest Bayesian idea is also the theoretically optimal one.

Regret Analysis — How Good Can We Get?

We've now seen four strategies: greedy, ε-greedy, UCB1, and Thompson Sampling. How do they compare, and is there a theoretical limit to how well any algorithm can do?

In 1985, Lai and Robbins proved a remarkable lower bound: no consistent algorithm can achieve expected regret better than:

lim inf R_T / ln T ≥ Σ_{k: μ_k < μ*} Δ_k / KL(μ_k, μ*)

where KL(μ_k, μ*) is the Kullback-Leibler divergence between the reward distributions of arm k and the optimal arm. This is profound: the difficulty of distinguishing a bad arm from the best arm is measured by KL divergence — the information-theoretic distance between their distributions. The harder two arms are to tell apart (small KL), the more samples you need, and the more regret you inevitably accumulate.

Exploration isn't a cost you can engineer away. It's the price of learning, and information theory sets the floor.

Here's how our algorithms stack up:

Pure Greedy: O(T) linear regret — never learns, locks onto first winner
ε-Greedy (fixed): O(εT) linear regret — keeps exploring forever, even when unnecessary
ε-Greedy (decaying): O(√T) sublinear — better, but still not optimal
UCB1: O(√(KT ln T)) gap-independent — logarithmic in gap-dependent form, but with a constant of 8 vs. Lai-Robbins' optimal
Thompson Sampling: Matches the Lai-Robbins bound — asymptotically optimal

probs = bernoulli_bandit(K=5, seed=42)
T = 10000

# Run all four algorithms on the same bandit
_, _, greedy_regret, _ = run_greedy(probs, T=T, seed=0)
reg_eps, _ = run_epsilon_greedy(probs, T=T, epsilon=0.1, seed=0)
reg_ucb, _ = run_ucb1(probs, T=T, seed=0)
reg_ts, _, _, _ = run_thompson(probs, T=T, seed=0)

print(f"Cumulative regret after {T:,} steps:")
print(f"  Greedy:           {greedy_regret:.1f}")
print(f"  Epsilon-greedy:   {reg_eps[-1]:.1f}")
print(f"  UCB1:             {reg_ucb[-1]:.1f}")
print(f"  Thompson:         {reg_ts[-1]:.1f}")
print(f"\nAsymptotic rates:")
print(f"  Greedy:           O(T)          -- linear, catastrophic")
print(f"  Eps-greedy:       O(eps * T)    -- linear, proportional to eps")
print(f"  UCB1:             O(sqrt(KT logT)) -- sublinear, logarithmic")
print(f"  Thompson:         Lai-Robbins   -- optimal, can't do better")

The gap between linear and logarithmic regret is enormous in practice. After 10,000 steps, greedy might accumulate 500+ regret while Thompson Sampling sits at 30-50. That's the difference between wasting a third of your budget and wasting 1%.

Contextual Bandits — When Arms Have Features

Real-world bandits rarely exist in a vacuum. When Netflix chooses a thumbnail, it knows who you are. When an ad platform picks a creative, it knows the page content. When a clinical trial assigns a treatment, it knows the patient's medical history. The contextual bandit extends our framework: at each step, we observe a context (features) before choosing an arm.

The most elegant contextual algorithm is LinUCB (Li et al., 2010), which models the expected reward as a linear function of context: E[r | x] = θ^T x. For each arm, we maintain a ridge regression estimate of θ and add a UCB-style bonus based on the uncertainty of our prediction.

The algorithm keeps a matrix A (accumulates outer products of context vectors) and a vector b (accumulates reward-weighted contexts). The UCB score for each arm is θ̂^Tx + α · √(x^TA^-1x) — our reward estimate plus an exploration bonus proportional to how uncertain we are about this particular context.

def run_linucb(n_articles, d_features, T=1000, alpha=1.0, seed=0):
    """LinUCB for contextual bandits with linear reward model."""
    rng = np.random.RandomState(seed)

    # True (hidden) article reward weights
    true_theta = rng.randn(n_articles, d_features) * 0.5

    # Per-arm models: A = d x d matrix, b = d vector
    A = [np.eye(d_features) for _ in range(n_articles)]
    b = [np.zeros(d_features) for _ in range(n_articles)]

    total_reward = 0
    optimal_reward = 0

    for t in range(T):
        # Observe context (user features)
        context = rng.randn(d_features)
        context = context / np.linalg.norm(context)

        # Compute UCB for each article
        ucb_scores = np.zeros(n_articles)
        for arm in range(n_articles):
            A_inv = np.linalg.inv(A[arm])
            theta_hat = A_inv @ b[arm]
            bonus = alpha * np.sqrt(context @ A_inv @ context)
            ucb_scores[arm] = theta_hat @ context + bonus

        chosen = np.argmax(ucb_scores)

        # Observe reward (linear + noise)
        true_rewards = [true_theta[a] @ context for a in range(n_articles)]
        noise = rng.randn() * 0.1
        reward = true_rewards[chosen] + noise

        # Update chosen arm's model
        A[chosen] += np.outer(context, context)
        b[chosen] += reward * context

        total_reward += reward
        optimal_reward += max(true_rewards)

    regret = optimal_reward - total_reward
    print(f"LinUCB after {T} rounds:")
    print(f"  Total reward:   {total_reward:.1f}")
    print(f"  Optimal reward: {optimal_reward:.1f}")
    print(f"  Regret:         {regret:.1f}")
    print(f"  Avg regret/step: {regret/T:.4f}")

run_linucb(n_articles=5, d_features=4, T=2000, alpha=1.0)

LinUCB was deployed on Yahoo's front page "Today Module" and achieved a 12.5% click-through-rate improvement over context-free bandits. This connected bandits to the recommender systems world, where the explore-exploit trade-off determines what content millions of users see.

Real-World Applications — Bandits Everywhere

The bandit framework extends far beyond slot machines:

A/B Testing vs. Adaptive Experiments. Traditional A/B tests split traffic 50/50 between variants for a fixed duration — even when one variant is clearly winning after the first day. Bandit-based "adaptive experiments" dynamically shift traffic toward the winning variant, reducing the cost of testing. You still get valid statistical conclusions, but fewer users see the worse option. This connects to our discussion of evaluation methodology — bandits let you evaluate and optimize simultaneously.

Clinical Trials. The explore-exploit trade-off has an ethical dimension in medicine. Traditional trials randomly assign patients to treatment and control groups, which means half the patients receive the inferior treatment even after strong evidence accumulates. Adaptive trial designs use bandit algorithms to allocate more patients to the treatment that appears better while still maintaining statistical rigor.

Hyperparameter Tuning. Hyperband — one of the most popular hyperparameter optimization methods — is essentially a bandit algorithm over configurations. It uses successive halving to quickly discard unpromising hyperparameter settings, treating each configuration as an arm. Our Bayesian optimization post covers the GP-UCB approach, which is literally a bandit algorithm over a continuous function space.

LLM Routing. Modern AI systems increasingly use bandit algorithms to route queries to different models. Easy questions go to fast, cheap models; hard questions go to expensive, powerful ones. The system learns which model works best for each query type — a contextual bandit problem where the context is the query and the arms are models.

The multi-armed bandit is one of those rare problems where the mathematics is beautiful, the algorithms are simple to implement, and the practical impact is enormous. From a casino thought experiment to a framework that powers billions of daily decisions across the internet — all because someone asked: should I try something new, or stick with what works?

Play with the demos below to build intuition. Watch how greedy locks onto its first impression, how ε-greedy wastes effort on known-bad arms, and how Thompson Sampling's posteriors gracefully narrow onto the truth.

Try It: The Bandit Casino

Select an algorithm and watch it play, or click arms yourself to compete. Each arm has a hidden payout probability — can you beat the algorithms?

Algorithm:

Arms:

Speed: 10/s

Step: 0 Total Reward: 0 Regret: 0.0

Try It: Posterior Evolution Viewer

Watch Thompson Sampling's Beta posteriors evolve in real-time. Each curve represents the algorithm's belief about an arm's true payout. Dashed lines show the hidden true probabilities.

Speed: 5/s

References & Further Reading

William Thompson — "On the Likelihood that One Unknown Probability Exceeds Another" (1933) — The original Thompson Sampling paper, 69 years ahead of its time
Auer, Cesa-Bianchi & Fischer — "Finite-time Analysis of the Multiarmed Bandit Problem" (2002) — The UCB1 paper with the first tight finite-time regret bounds
Li, Chu, Langford & Schapire — "A Contextual-Bandit Approach to Personalized News Article Recommendation" (2010) — LinUCB deployed at Yahoo with real-world CTR improvements
Kaufmann, Korda & Munos — "Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis" (2012) — Proved Thompson Sampling matches the Lai-Robbins lower bound
Russo, Van Roy, Kazerouni, Osband & Wen — "A Tutorial on Thompson Sampling" (2018) — Comprehensive modern tutorial covering theory and applications
Aleksandrs Slivkins — "Introduction to Multi-Armed Bandits" (2019) — Excellent monograph covering the full landscape of bandit algorithms
Lattimore & Szepesvári — "Bandit Algorithms" (2020) — The definitive textbook, freely available online at banditalgs.com
DadOps — Reinforcement Learning from Scratch — Bandits are the simplest RL problem; this post covers the full MDP framework
DadOps — Bayesian Inference from Scratch — The Beta-Bernoulli conjugate prior that powers Thompson Sampling
DadOps — Information Theory from Scratch — KL divergence appears in the Lai-Robbins lower bound as the fundamental cost of learning