Online Learning from Scratch
1. The Online Learning Game
Most machine learning assumes your data is i.i.d. — independent samples drawn from a nice, well-behaved distribution. But what happens when the data fights back?
A spammer adapts to your filter in real time. A market reacts to your trades before you can blink. A user's preferences shift faster than your model can retrain. In these settings, the i.i.d. assumption isn't just wrong — it's dangerous.
Online learning throws away the i.i.d. assumption entirely. It frames learning as a repeated game between a learner and an adversary. Each round t = 1, 2, …, T:
- The learner picks an action wt (a prediction, a weight vector, a distribution over experts)
- The adversary reveals a loss function ℓt
- The learner suffers loss ℓt(wt)
The adversary can be anything — nature, a strategic opponent, or even the worst-case sequence designed to maximize your pain. The learner's goal is to minimize regret: the gap between their cumulative loss and the loss of the best fixed action in hindsight.
RT = Σt=1..T ℓt(wt) − minw Σt=1..T ℓt(w)
If regret is sublinear — say O(√T) — then the per-round regret RT/T → 0 as T grows. The learner converges to the performance of the best fixed strategy, even though it never knew what that strategy was. And crucially, this guarantee holds no matter what the adversary does.
This is fundamentally different from the bandit setting where you only see the loss of your chosen action. In online learning, you see the full loss vector — every expert's loss, every action's outcome. More information means tighter bounds.
2. Prediction with Expert Advice
The simplest online learning problem: you have N experts, each making a prediction every round. You need to combine their advice into a single decision. Which experts should you trust?
The Weighted Majority algorithm is beautifully simple. Start by trusting everyone equally. Each round, make the majority-weighted prediction. Then penalize the experts who were wrong by shrinking their weights:
wt+1(i) = wt(i) · (1 − ε) if expert i was wrong
The parameter ε ∈ (0, 1) controls how aggressively you punish mistakes. Too high and you overreact to noise; too low and you're slow to abandon bad experts. The sweet spot gives a regret bound of O(√(T log N)) — logarithmic in the number of experts, which means you can handle thousands of experts with barely any extra cost.
Let's implement it. We create 8 experts with varying accuracy, run 200 rounds with random labels, and watch the algorithm track the best expert:
import numpy as np
np.random.seed(42)
T, N = 200, 8
accuracies = [0.85, 0.75, 0.70, 0.65, 0.60, 0.55, 0.50, 0.45]
# Ground truth and expert predictions
labels = np.random.randint(0, 2, T)
expert_preds = np.zeros((T, N), dtype=int)
for i, acc in enumerate(accuracies):
correct = np.random.random(T) < acc
expert_preds[:, i] = np.where(correct, labels, 1 - labels)
# Weighted Majority Algorithm
weights = np.ones(N)
epsilon = 0.3
algo_loss = 0
for t in range(T):
# Weighted majority vote
vote = np.zeros(2)
for c in range(2):
vote[c] = weights[expert_preds[t] == c].sum()
pred = np.argmax(vote)
# Update: penalize wrong experts
algo_loss += int(pred != labels[t])
wrong = expert_preds[t] != labels[t]
weights[wrong] *= (1 - epsilon)
best = min(np.sum(expert_preds[:, i] != labels) for i in range(N))
print(f"Algorithm mistakes: {algo_loss}")
print(f"Best expert mistakes: {best}")
print(f"Regret: {algo_loss - best}")
print(f"Per-round regret: {(algo_loss - best) / T:.4f}")
# Algorithm: ~42 mistakes, Best expert: ~30, Regret: ~12
The key insight: even without knowing in advance which expert is best, the algorithm's regret stays small. As T grows, the per-round regret vanishes. Weights encode accumulated trust — experts who fail lose credibility exponentially fast.
3. The Multiplicative Weights Update
Weighted Majority handles binary right/wrong outcomes. But what about continuous losses in [0, 1]? The Multiplicative Weights Update (MW) generalizes elegantly. Instead of zeroing out wrong experts, we downweight them proportionally to their loss:
wt+1(i) ∝ wt(i) · exp(−η · ℓt(i))
The learning rate η controls sensitivity. The fundamental theorem of MW gives a tight regret bound:
Regret ≤ (ln N)/η + ηT/2
Setting η = √(2 ln N / T) minimizes this to O(√(T ln N)). This bound is tight — there exists an adversary that forces exactly this much regret. MW is optimal.
The reach of MW extends far beyond expert advice. AdaBoost is MW applied to training examples. In game theory, MW converges to Nash equilibria. In combinatorial optimization, MW solves linear programs approximately. It's one of the most versatile algorithms in computer science.
Let's see MW handle an adversarial scenario where the "best" expert shifts every 100 rounds:
import numpy as np
np.random.seed(123)
T, N = 500, 5
eta = np.sqrt(2 * np.log(N) / T) # optimal learning rate
# Adversarial losses: each expert is "best" for 100 rounds
losses = np.random.uniform(0.3, 0.7, (T, N))
for phase in range(5):
s, e = phase * 100, (phase + 1) * 100
losses[s:e, phase] = np.random.uniform(0.0, 0.2, 100)
# Multiplicative Weights Update
weights = np.ones(N) / N
cumulative_algo = 0.0
for t in range(T):
# Expected loss under current distribution
cumulative_algo += weights @ losses[t]
# Exponential update + renormalize
weights *= np.exp(-eta * losses[t])
weights /= weights.sum()
best_expert = min(losses[:, i].sum() for i in range(N))
regret = cumulative_algo - best_expert
bound = np.sqrt(2 * T * np.log(N)) # (ln N)/eta + eta*T/2
print(f"Optimal eta: {eta:.4f}")
print(f"Algorithm loss: {cumulative_algo:.1f}")
print(f"Best expert: {best_expert:.1f}")
print(f"Regret: {regret:.1f}")
print(f"Theory bound: {bound:.1f}")
# Actual regret is well within the sqrt(2 T ln N) bound!
Notice how the algorithm's regret stays well within the theoretical bound, even though the best expert keeps changing. MW doesn't need to know who's best right now — it maintains a distribution that hedges against uncertainty.
Try It: Expert Advice Arena
4. Online Gradient Descent
Expert advice works when you have a finite set of strategies. But what if your action space is continuous — say, all weight vectors w ∈ Rd? You can't enumerate every possible model.
Online Gradient Descent (OGD) extends online learning to convex optimization. The update is exactly what you'd expect — take a gradient step and project back onto the feasible set:
wt+1 = ΠK(wt − ηt ∇ℓt(wt))
With learning rate ηt = O(1/√t), OGD achieves O(√T) regret for convex losses. For strongly convex losses (like squared loss with regularization), the bound improves to O(log T) — exponentially better.
Here's the deep connection that most textbooks miss: SGD is just OGD where the "adversary" is random sampling from the dataset. When you run stochastic gradient descent on a training set, you're playing an online learning game where each mini-batch is one round. The convergence guarantees of SGD come directly from OGD's regret bounds. This is why SGD generalizes — it has low regret even if the data were adversarial, so random data is easy.
Let's implement OGD for online linear regression, where data arrives one point at a time:
import numpy as np
np.random.seed(7)
d, T = 5, 1000
w_star = np.random.randn(d) # unknown optimal weights
w = np.zeros(d)
cumulative_loss, best_loss = 0.0, 0.0
for t in range(1, T + 1):
# Adversary reveals data point
x = np.random.randn(d)
y = w_star @ x + 0.1 * np.random.randn()
# Learner's prediction and loss
pred = w @ x
loss = (pred - y) ** 2
cumulative_loss += loss
best_loss += (w_star @ x - y) ** 2
# OGD update: gradient of squared loss
grad = 2 * (pred - y) * x
eta = 1.0 / np.sqrt(t)
w -= eta * grad
regret = cumulative_loss - best_loss
print(f"Cumulative loss (OGD): {cumulative_loss:.1f}")
print(f"Best-in-hindsight: {best_loss:.1f}")
print(f"Regret: {regret:.1f}")
print(f"Regret / sqrt(T): {regret / np.sqrt(T):.2f}")
print(f"||w - w*||: {np.linalg.norm(w - w_star):.4f}")
# Regret scales as O(sqrt(T)), w converges to w_star
The regret grows as O(√T) — confirming the theory. And the weight vector converges toward the true optimum, even though we never had access to the full dataset at once.
5. Follow the Regularized Leader
Here's the beautiful unifying insight: multiplicative weights, online gradient descent, and many other algorithms are all instances of a single framework called Follow the Regularized Leader (FTRL).
The idea is elegant. At each round, play the action that minimizes your total past loss, plus a regularization term that prevents you from changing too aggressively:
wt = argminw ∈ K [Σs=1..t-1 ℓs(w) + R(w)]
The choice of regularizer R(w) determines which algorithm you recover:
R(w) = ½||w||²(L2 / Euclidean) → recovers Online Gradient DescentR(w) = Σ wi log wi(negative entropy) → recovers Multiplicative WeightsR(w) = ||w||1(L1) → produces sparse online solutions
FTRL isn't just theory — it runs in production at massive scale. Google's ad click-through rate prediction uses FTRL with L1 regularization to maintain sparse models over billions of features, processing millions of updates per second.
Let's verify the connection: running MW (entropic FTRL) and OGD (L2 FTRL) on the same problem and comparing their behavior:
import numpy as np
np.random.seed(42)
T, N = 300, 6
losses = np.random.uniform(0.3, 0.7, (T, N))
losses[:, 2] = np.random.uniform(0.0, 0.2, T) # expert 2 best
eta = np.sqrt(2 * np.log(N) / T)
# FTRL with entropic regularizer -> Multiplicative Weights
w_mw = np.ones(N) / N
mw_cum = 0.0
# FTRL with L2 regularizer -> Online Gradient Descent
w_ogd = np.ones(N) / N
ogd_cum = 0.0
for t in range(T):
# Play distributions, suffer expected loss
mw_cum += w_mw @ losses[t]
ogd_cum += w_ogd @ losses[t]
# MW update (entropic regularizer)
w_mw *= np.exp(-eta * losses[t])
w_mw /= w_mw.sum()
# OGD update (L2 regularizer)
w_ogd -= eta * losses[t]
w_ogd = np.maximum(w_ogd, 0)
if w_ogd.sum() > 0:
w_ogd /= w_ogd.sum()
else:
w_ogd = np.ones(N) / N
best = min(losses[:, i].sum() for i in range(N))
bound = np.sqrt(2 * T * np.log(N))
print(f"MW (entropic FTRL) regret: {mw_cum - best:.1f}")
print(f"OGD (L2 FTRL) regret: {ogd_cum - best:.1f}")
print(f"Theoretical bound: {bound:.1f}")
print(f"\nMW final weights: {np.round(w_mw, 3)}")
print(f"OGD final weights: {np.round(w_ogd, 3)}")
# Both achieve sublinear regret; MW adapts faster on simplex
Both algorithms achieve sublinear regret, but their weight dynamics differ. MW makes multiplicative updates, which are natural on the probability simplex — weights can shrink exponentially fast but never become negative. OGD makes additive updates, requiring projection back onto the simplex. The entropic regularizer is geometrically "right" for distributions, which is why MW tends to outperform OGD in the expert setting.
6. From Online to Batch — Why This All Matters
Here's the punchline that connects online learning to everything you already do. The online-to-batch conversion says: if you have an online algorithm with sublinear regret, you automatically get a good batch learning algorithm for free.
The recipe is simple: run your online algorithm on streaming data, then average all the iterates. The averaged predictor w̄ = (1/T) Σ wt has expected loss within O(1/√T) of the optimal fixed predictor. This is the Polyak-Ruppert averaging technique, and it works because sublinear regret implies the iterates spend most of their time near the optimum.
This means every time you train a neural network with SGD and average checkpoints, you're performing an online-to-batch conversion. The convergence guarantees of logistic regression trained with SGD come directly from online learning theory.
import numpy as np
np.random.seed(99)
n, d = 500, 3
X = np.random.randn(n, d)
w_true = np.array([1.0, -0.5, 0.3])
y = np.sign(X @ w_true + 0.2 * np.random.randn(n))
# Online GD on logistic loss, one sample at a time
w = np.zeros(d)
all_w = [w.copy()]
for t in range(n):
margin = y[t] * (w @ X[t])
sig = 1.0 / (1.0 + np.exp(np.clip(margin, -30, 30)))
grad = -y[t] * X[t] * sig
eta = 1.0 / np.sqrt(t + 1)
w = w - eta * grad
all_w.append(w.copy())
# Online-to-batch: Polyak-Ruppert averaging
w_avg = np.mean(all_w, axis=0)
acc_last = np.mean(np.sign(X @ w) == y)
acc_avg = np.mean(np.sign(X @ w_avg) == y)
acc_opt = np.mean(np.sign(X @ w_true) == y)
print(f"Last iterate accuracy: {acc_last:.3f}")
print(f"Averaged (Polyak): {acc_avg:.3f}")
print(f"True weights: {acc_opt:.3f}")
# Averaging smooths out SGD oscillations -> better generalization
The averaged iterate typically outperforms the last iterate because it smooths out the oscillations inherent in online updates. The last iterate bounces around the optimum; the average sits right on top of it.
Try It: Online vs Batch Decision Boundary
7. Adaptive Algorithms — When You Don't Know the Geometry
There's a catch with OGD: setting the optimal learning rate requires knowing the time horizon T in advance. In practice, you rarely know how long you'll be learning. The "doubling trick" (restart with doubled T when you run out of budget) works but wastes a constant factor.
AdaGrad solves this elegantly by maintaining per-coordinate learning rates that adapt to the observed gradients:
ηt,i = 1 / √(Σs=1..t g2s,i)
Features that receive large, frequent gradients get smaller learning rates (they've already learned plenty). Features with rare, small gradients keep large learning rates (they need every update they can get). This is exactly right for sparse data where some features appear rarely — the rare features get amplified updates when they finally show up.
The regret bound adapts to the problem geometry: O(√(T · maxi Σ g2i)). If gradients are sparse, this is much tighter than uniform OGD's bound. And here's the connection that closes the loop: Adam and RMSProp are online learning algorithms with adaptive regularization. Every time you call optimizer.step() in PyTorch, you're running an online learning algorithm descended from this theory.
import numpy as np
np.random.seed(2026)
T, d = 1000, 20
w_true = np.zeros(d)
w_true[0], w_true[5], w_true[15] = 2.0, -1.5, 1.0
# Online squared-loss regression with sparse features
w_uniform, w_ada = np.zeros(d), np.zeros(d)
G = np.zeros(d) # AdaGrad accumulator
loss_uniform, loss_ada = 0.0, 0.0
for t in range(1, T + 1):
x = np.zeros(d)
active = np.random.choice(d, 3, replace=False)
x[active] = np.random.randn(3)
y = w_true @ x + 0.1 * np.random.randn()
# Uniform-eta OGD
pred_u = w_uniform @ x
loss_uniform += (pred_u - y) ** 2
grad = 2 * (pred_u - y) * x
w_uniform -= (0.1 / np.sqrt(t)) * grad
# AdaGrad: per-coordinate adaptive eta
pred_a = w_ada @ x
loss_ada += (pred_a - y) ** 2
grad_a = 2 * (pred_a - y) * x
G += grad_a ** 2
w_ada -= 0.5 * grad_a / (np.sqrt(G) + 1e-8)
print(f"Uniform OGD loss: {loss_uniform:.1f}")
print(f"AdaGrad loss: {loss_ada:.1f}")
print(f"Improvement: {(1 - loss_ada / loss_uniform) * 100:.1f}%")
top = np.argsort(np.abs(w_ada))[::-1][:5]
for idx in top:
print(f" w[{idx:2d}] = {w_ada[idx]:+.3f} (true: {w_true[idx]:+.1f})")
# AdaGrad excels on sparse problems with rare features
AdaGrad discovers the relevant features (0, 5, 15) and learns their weights accurately, while uniform OGD wastes learning rate on the 17 irrelevant features. This per-coordinate adaptivity is what makes Adam so effective for training language models, where gradient magnitudes vary wildly across embedding dimensions.
8. Conclusion
We started with a simple question: can you learn when the data fights back? The answer is a resounding yes.
- Multiplicative Weights achieves optimal O(√(T ln N)) regret for the expert setting — works against any adversary
- Online Gradient Descent extends this to continuous parameter spaces with O(√T) regret
- FTRL unifies both (and more) under one elegant framework where the regularizer determines the algorithm
- Online-to-batch conversion connects online learning to classical ML — SGD's guarantees come from regret bounds
- AdaGrad shows adaptive algorithms can exploit problem structure without knowing the geometry in advance
Every time you train a neural network with Adam, you're running an online learning algorithm. Every time you average checkpoints, you're performing an online-to-batch conversion. The theory we built from scratch in this post is the foundation underneath your daily ML practice — whether you knew it or not.
References & Further Reading
- Shai Shalev-Shwartz — Online Learning and Online Convex Optimization (2012) — Comprehensive survey connecting theory to practice
- Nicolò Cesa-Bianchi & Gábor Lugosi — Prediction, Learning, and Games (2006) — The definitive textbook on online prediction
- Elad Hazan — Introduction to Online Convex Optimization (2016) — Accessible monograph with modern perspective
- Yoav Freund & Robert Schapire — A Decision-Theoretic Generalization of On-Line Learning (1997) — The paper that launched multiplicative weights in ML
- Martin Zinkevich — Online Convex Programming (2003) — The foundational OGD paper
- H. Brendan McMahan — Follow-the-Regularized-Leader and Mirror Descent (2011) — Unifies FTRL with mirror descent
- John Duchi, Elad Hazan & Yoram Singer — Adaptive Subgradient Methods (2011) — The original AdaGrad paper