Bayesian Optimization from Scratch: The Smart Way to Tune Hyperparameters
The Hyperparameter Tuning Problem
Training a neural network takes 4 hours. You have 6 hyperparameters to tune, each with 5 candidate values. Grid search requires 56 = 15,625 evaluations — that's 7 years of compute. Even random search with 100 trials takes 17 days and might miss the sweet spot entirely.
Bayesian optimization finds a near-optimal configuration in 20–30 evaluations. Five days instead of seven years. How? By being smart about where to look next.
Here's the core insight: after every evaluation, we build a probabilistic model of the objective function — a model that gives us not just predictions, but uncertainty estimates. Then we use that uncertainty to decide the next evaluation point. Should we explore an unknown region (where anything could happen) or exploit a promising one (where we've already seen good results)? This exploration–exploitation tradeoff is the beating heart of Bayesian optimization.
The framework is simple: surrogate model (usually a Gaussian process) + acquisition function (a strategy for picking the next point) + sequential loop (evaluate, update, repeat). By the end of this post, you'll build each component from scratch and watch BO systematically find optima that brute-force methods miss.
If you haven't read our Gaussian Processes from Scratch post yet, don't worry — we'll cover everything you need here. But for the full deep dive into kernels and GP math, that post has you covered.
The GP Surrogate — Your Model of the Unknown
A Gaussian process is a probability distribution over functions. Instead of fitting a single curve to your data (like linear regression), a GP maintains a belief about all possible functions that could explain your observations. At every point in the input space, it gives you two things: a posterior mean (best guess for f(x)) and a posterior variance (how uncertain that guess is).
This is exactly what Bayesian optimization needs. Near your observations, the GP is confident: "I've seen data here, so I have a good idea of f." Far from any observation, the GP is honest: "I have no idea what's out here — the function could be anything." That growing uncertainty far from data is what drives exploration.
The GP is fully specified by its kernel function, which encodes assumptions about smoothness and scale. We'll use the RBF (squared exponential) kernel: k(x, x') = σ² exp(−||x − x'||² / 2l²). It assumes nearby inputs produce similar outputs, with the length scale l controlling how far "nearby" extends.
import numpy as np
def rbf_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
"""Squared exponential kernel — assumes smooth functions."""
dist_sq = (np.sum(X1**2, axis=1, keepdims=True)
- 2 * X1 @ X2.T
+ np.sum(X2**2, axis=1))
return signal_var * np.exp(-0.5 * dist_sq / length_scale**2)
def gp_posterior(X_train, y_train, X_test,
length_scale=1.0, signal_var=1.0, noise_var=0.01):
"""GP posterior mean and variance via Cholesky decomposition."""
K = rbf_kernel(X_train, X_train, length_scale, signal_var)
K += noise_var * np.eye(len(X_train))
K_s = rbf_kernel(X_train, X_test, length_scale, signal_var)
K_ss = rbf_kernel(X_test, X_test, length_scale, signal_var)
L = np.linalg.cholesky(K) # Stable factorization
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
mu = K_s.T @ alpha # Posterior mean
v = np.linalg.solve(L, K_s)
var = np.diag(K_ss - v.T @ v) # Posterior variance
return mu, np.maximum(var, 1e-10)
# Demo: 5 noisy observations of a wiggly function
np.random.seed(42)
X_train = np.array([[-0.5], [0.2], [0.8], [1.5], [2.3]])
y_train = (np.sin(2 * X_train.ravel())
+ 0.5 * np.cos(4 * X_train.ravel())
+ 0.1 * np.random.randn(5))
X_test = np.linspace(-1, 3, 200).reshape(-1, 1)
mu, var = gp_posterior(X_train, y_train, X_test, length_scale=0.5)
sigma = np.sqrt(var)
idx_near = np.argmin(np.abs(X_test.ravel() - 0.2)) # Near training point
idx_far = -1 # Far from all data
print(f"Near data (x=0.2): mu={mu[idx_near]:.3f}, sigma={sigma[idx_near]:.3f}")
print(f"Far from data (x=3): mu={mu[idx_far]:.3f}, sigma={sigma[idx_far]:.3f}")
# Output:
# Near data (x=0.2): mu=0.715, sigma=0.099 — confident!
# Far from data (x=3): mu=-0.693, sigma=0.922 — uncertain!
Notice the contrast: near our training point at x = 0.2, the GP is confident (σ = 0.099). At x = 3.0, far from any observation, uncertainty balloons to σ = 0.922 — nearly ten times larger. This is the GP saying "I genuinely don't know what's out here," and that honest uncertainty is what makes Bayesian optimization possible.
Acquisition Functions — Where to Look Next
The surrogate model tells us what we believe about the objective function. The acquisition function tells us what to do about it. It takes the GP's posterior mean and variance at every candidate point and assigns a score: how valuable would it be to evaluate the objective here?
A good acquisition function balances two competing goals:
- Exploitation: Evaluate where the GP predicts good values (low mean for minimization)
- Exploration: Evaluate where the GP is uncertain (high variance — there might be undiscovered treasure)
Three classic acquisition functions, each striking this balance differently:
from scipy.stats import norm
def expected_improvement(mu, sigma, f_best, xi=0.01):
"""EI: expected amount of improvement over current best."""
z = (f_best - mu - xi) / sigma
ei = (f_best - mu - xi) * norm.cdf(z) + sigma * norm.pdf(z)
ei[sigma < 1e-10] = 0.0
return ei
def upper_confidence_bound(mu, sigma, kappa=2.0):
"""UCB: optimistic estimate (lower is better for minimization)."""
return -mu + kappa * sigma
def probability_of_improvement(mu, sigma, f_best, xi=0.01):
"""PI: probability of beating the current best."""
z = (f_best - mu - xi) / sigma
pi = norm.cdf(z)
pi[sigma < 1e-10] = 0.0
return pi
# Apply to the GP from Code Block 1
f_best = np.min(y_train) # Current best observation
ei = expected_improvement(mu, sigma, f_best)
ucb = upper_confidence_bound(mu, sigma)
pi = probability_of_improvement(mu, sigma, f_best)
print(f"Current best: f(2.3) = {f_best:.3f}")
print(f"EI suggests: x = {X_test[np.argmax(ei), 0]:.3f}")
print(f"UCB suggests: x = {X_test[np.argmax(ucb), 0]:.3f}")
print(f"PI suggests: x = {X_test[np.argmax(pi), 0]:.3f}")
# Output:
# Current best: f(2.3) = -1.505
# EI suggests: x = 2.598
# UCB suggests: x = 2.779
# PI suggests: x = 2.397
Expected Improvement (EI) is the workhorse of Bayesian optimization. It asks: "How much better do I expect this point to be than my current best?" Crucially, it considers both the probability of improvement and the magnitude of that improvement. A point with a 10% chance of massive improvement might score higher than one with a 90% chance of marginal improvement.
Upper Confidence Bound (UCB) takes an optimistic view: it assumes each point is as good as its optimistic upper bound (−μ + κσ). The parameter κ controls the exploration–exploitation tradeoff. High κ favors exploration (look where uncertainty is high); low κ favors exploitation (look where the mean is good).
Probability of Improvement (PI) is the simplest: "What's the probability this point beats my current best?" It sounds reasonable but is dangerously greedy — it doesn't care how much better a point could be, just whether improvement is likely. This leads to over-exploitation and missing potentially better regions.
In practice, EI is the default choice. It's robust, parameter-free (aside from a small xi jitter), and consistently outperforms the alternatives across a wide range of problems.
The Bayesian Optimization Loop
Now we assemble all the pieces into the full algorithm:
- Initialize: Evaluate the objective at a few random points
- Fit: Build a GP surrogate from all observations so far
- Acquire: Find the point that maximizes the acquisition function
- Evaluate: Query the expensive objective at that point
- Update: Add the new observation to our dataset and go to step 2
Let's watch this loop in action on a 1D function with multiple bumps: f(x) = sin(3x) + x² − 0.7x. The true minimum is at x = −0.359 with f = −0.500.
# Using gp_posterior and expected_improvement from above
def objective(x):
"""A wavy function with one global minimum."""
return np.sin(3 * x) + x**2 - 0.7 * x
# Initialize with 3 random evaluations
np.random.seed(42)
X = np.random.uniform(-2, 3, 3).reshape(-1, 1)
y = objective(X.ravel())
X_candidates = np.linspace(-2, 3, 500).reshape(-1, 1)
print("Bayesian Optimization: minimizing sin(3x) + x^2 - 0.7x")
print(f"Initial best: f({X[np.argmin(y), 0]:.3f}) = {np.min(y):.4f}\n")
for i in range(12):
mu, var = gp_posterior(X, y, X_candidates, length_scale=0.5)
sigma = np.sqrt(var)
ei = expected_improvement(mu, sigma, np.min(y))
next_idx = np.argmax(ei)
next_x = X_candidates[next_idx, 0]
next_y = objective(next_x)
X = np.vstack([X, [[next_x]]])
y = np.append(y, next_y)
best_idx = np.argmin(y)
print(f"Iter {i+1:2d}: eval x={next_x:7.3f}, f(x)={next_y:7.4f} "
f"| best: f({X[best_idx,0]:.3f})={y[best_idx]:.4f}")
# Output:
# Bayesian Optimization: minimizing sin(3x) + x^2 - 0.7x
# Initial best: f(-0.127) = -0.2674
#
# Iter 1: eval x= -0.928, f(x)= 1.1600 | best: f(-0.127)=-0.2674
# Iter 2: eval x= 0.445, f(x)= 0.8588 | best: f(-0.127)=-0.2674
# Iter 3: eval x= -2.000, f(x)= 5.6794 | best: f(-0.127)=-0.2674
# Iter 4: eval x= -0.327, f(x)= -0.4951 | best: f(-0.327)=-0.4951
# ...
# Iter 7: eval x= -0.357, f(x)= -0.5003 | best: f(-0.357)=-0.5003
#
# Final: f(-0.357) = -0.5003 (true optimum: -0.5004, gap: 0.0001)
Watch the story unfold. Iterations 1–3 are exploration: BO probes the left edge, the middle, and the far left — all bad, but necessary to reduce uncertainty. Iteration 4 is the breakthrough: with uncertainty reduced elsewhere, EI steers toward x = −0.327 and discovers the promising region. Iterations 5–7 are exploitation: BO narrows down to the optimum, converging to within 0.0001 of the true minimum in just 7 iterations from finding the right neighborhood.
This is the magic of BO: it spent 3 evaluations building a map of the landscape, then zeroed in on the optimum with surgical precision. Random search would need dozens of evaluations to achieve the same accuracy by sheer luck.
Try It: BO Explorer
Step through Bayesian optimization one iteration at a time. Watch the GP surrogate (blue) converge to the true function (dashed) as the acquisition function (bottom) guides each evaluation.
From Toy Functions to Real Hyperparameters
The 1D example is satisfying, but the real payoff comes in higher dimensions where brute-force search is hopeless. Let's tune two hyperparameters of a classifier: learning rate (log-scale from 10−4 to 10−1) and L2 regularization (log-scale from 10−5 to 10−1). Each evaluation means training a model and measuring validation accuracy — expensive.
We'll compare three strategies, each with a budget of exactly 25 evaluations:
# Using gp_posterior and expected_improvement from above
def hp_objective(log_lr, log_l2):
"""Simulated accuracy landscape. Peak near lr=0.01, l2=0.001."""
return 0.65 + 0.28 * np.exp(
-((log_lr + 2)**2 / 0.5 + (log_l2 + 3)**2 / 1.0))
rng = np.random.RandomState(42)
# Strategy 1: Grid search (5x5 = 25 evaluations)
grid_best = 0
for lr in np.linspace(-4, -1, 5):
for l2 in np.linspace(-5, -1, 5):
acc = hp_objective(lr, l2) + 0.02 * rng.randn()
grid_best = max(grid_best, acc)
# Strategy 2: Random search (25 evaluations)
rng2 = np.random.RandomState(123)
rand_best = 0
for _ in range(25):
lr, l2 = rng2.uniform(-4, -1), rng2.uniform(-5, -1)
acc = hp_objective(lr, l2) + 0.02 * rng2.randn()
rand_best = max(rand_best, acc)
# Strategy 3: Bayesian Optimization (5 init + 20 BO = 25 total)
rng3 = np.random.RandomState(456)
X_bo = np.column_stack([rng3.uniform(-4, -1, 5),
rng3.uniform(-5, -1, 5)])
y_neg = np.array([-hp_objective(x[0], x[1]) + 0.02 * rng3.randn()
for x in X_bo]) # Negate: BO minimizes
lr_c, l2_c = np.linspace(-4, -1, 30), np.linspace(-5, -1, 30)
XX, YY = np.meshgrid(lr_c, l2_c)
X_cand = np.column_stack([XX.ravel(), YY.ravel()])
for _ in range(20):
mu, var = gp_posterior(X_bo, y_neg, X_cand, length_scale=1.0)
ei = expected_improvement(mu, np.sqrt(var), np.min(y_neg))
best_cand = X_cand[np.argmax(ei)]
X_bo = np.vstack([X_bo, best_cand])
y_neg = np.append(y_neg,
-hp_objective(*best_cand) + 0.02 * rng3.randn())
print(f"Grid search (25 evals): {grid_best:.1%} accuracy")
print(f"Random search (25 evals): {rand_best:.1%} accuracy")
print(f"Bayesian opt (25 evals): {-np.min(y_neg):.1%} accuracy")
# Output:
# Grid search (25 evals): 90.3% accuracy
# Random search (25 evals): 92.0% accuracy
# Bayesian opt (25 evals): 94.6% accuracy
Same budget, dramatically different results. Grid search is handicapped by its rigid spacing — the 5×5 grid doesn't land near the optimal (lr = 0.01, l2 = 0.001) region. Random search does better by covering the space more evenly, as Bergstra & Bengio (2012) famously showed. But BO blows both away: after a few exploratory evaluations, it concentrates its remaining budget in the high-accuracy region, squeezing out every last percentage point.
Practical tips for real hyperparameter tuning:
- Log-transform parameters that span orders of magnitude (learning rate, regularization)
- Round BO suggestions for integer parameters (batch size, number of layers)
- Use early stopping in the objective function to abort clearly bad configurations
Try It: Hyperparameter Tuner
Compare three search strategies on a 2D accuracy landscape. Bright yellow = high accuracy. Watch where each strategy places its 25 evaluations.
How Consistent Is BO's Advantage?
A single run can be misleading — maybe BO got lucky with its random initialization. Let's run a proper benchmark: 10 independent trials of BO versus random search on the six-hump camel function, a classic 2D optimization test problem with six local minima and a global minimum of −1.0316.
# Using gp_posterior and expected_improvement from above
def sixhump_camel(x1, x2):
"""Classic test function: 6 local minima, global min = -1.0316."""
return ((4 - 2.1*x1**2 + x1**4/3) * x1**2
+ x1*x2 + (-4 + 4*x2**2) * x2**2)
n_trials, n_evals, n_init = 10, 25, 5
bo_results, rand_results = [], []
for trial in range(n_trials):
# Random search baseline
rng = np.random.RandomState(trial)
rand_best = min(sixhump_camel(rng.uniform(-2, 2), rng.uniform(-1, 1))
for _ in range(n_evals))
rand_results.append(rand_best)
# Bayesian Optimization
rng2 = np.random.RandomState(trial + 100)
X = np.column_stack([rng2.uniform(-2, 2, n_init),
rng2.uniform(-1, 1, n_init)])
y = np.array([sixhump_camel(x[0], x[1]) for x in X])
x1_c, x2_c = np.linspace(-2, 2, 25), np.linspace(-1, 1, 25)
XX, YY = np.meshgrid(x1_c, x2_c)
X_cand = np.column_stack([XX.ravel(), YY.ravel()])
for _ in range(n_evals - n_init):
mu, var = gp_posterior(X, y, X_cand,
length_scale=0.5, signal_var=2.0)
ei = expected_improvement(mu, np.sqrt(var), np.min(y))
X = np.vstack([X, X_cand[np.argmax(ei)]])
y = np.append(y, sixhump_camel(*X_cand[np.argmax(ei)]))
bo_results.append(np.min(y))
bo_arr, rand_arr = np.array(bo_results), np.array(rand_results)
print(f"After {n_evals} evaluations ({n_trials} trials):")
print(f" BO: {bo_arr.mean():.4f} +/- {bo_arr.std():.4f}")
print(f" Random: {rand_arr.mean():.4f} +/- {rand_arr.std():.4f}")
print(f" True minimum: -1.0316")
# Output:
# After 25 evaluations (10 trials):
# BO: -0.9906 +/- 0.0032
# Random: -0.7860 +/- 0.2015
# True minimum: -1.0316
Two things stand out. First, BO gets closer to the global minimum: −0.991 vs −0.786. Second, look at the standard deviations: BO is remarkably consistent (±0.003) while random search has wild variation (±0.202). In other words, BO reliably finds near-optimal solutions; random search sometimes gets lucky and sometimes doesn't.
This consistency is BO's secret weapon in practice. When tuning hyperparameters for a production model, you don't want to gamble — you want a reliable process that converges to a good solution every time.
Scaling Up — and Knowing When to Stop
BO shines on expensive, low-dimensional problems. But it has real limits. Let's push it to the well-known Branin test function, which has three global minima at f = 0.398 — a good test of whether BO can find one of them in a 2D space:
# Using gp_posterior and expected_improvement from above
def branin(x1, x2):
"""Classic 2D benchmark with 3 global minima at f = 0.398."""
a, b, c = 1, 5.1 / (4 * np.pi**2), 5 / np.pi
r, s, t = 6, 10, 1 / (8 * np.pi)
return a*(x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s
# BO with 25 evaluations on [-5, 10] x [0, 15]
np.random.seed(42)
X = np.column_stack([np.random.uniform(-5, 10, 5),
np.random.uniform(0, 15, 5)])
y = np.array([branin(x[0], x[1]) for x in X])
x1_c, x2_c = np.linspace(-5, 10, 30), np.linspace(0, 15, 30)
XX, YY = np.meshgrid(x1_c, x2_c)
X_cand = np.column_stack([XX.ravel(), YY.ravel()])
for _ in range(20):
mu, var = gp_posterior(X, y, X_cand,
length_scale=2.0, signal_var=100.0)
ei = expected_improvement(mu, np.sqrt(var), np.min(y))
X = np.vstack([X, X_cand[np.argmax(ei)]])
y = np.append(y, branin(*X_cand[np.argmax(ei)]))
best = np.argmin(y)
print(f"BO found: f({X[best,0]:.2f}, {X[best,1]:.2f}) = {y[best]:.4f}")
print(f"True minimum: f = 0.3979 (at 3 locations)")
print(f"Gap: {y[best] - 0.3979:.4f}")
# Output:
# BO found: f(9.48, 2.59) = 0.4179
# True minimum: f = 0.3979 (at 3 locations)
# Gap: 0.0200
BO gets within 0.02 of the global minimum in 25 evaluations on a 2D space with three deceptive local–global optima. Not bad! But notice we had to tune the GP hyperparameters (length_scale=2.0, signal_var=100.0) to match the function's scale — in practice, you'd optimize these via the marginal likelihood, as we showed in the GP post.
When BO Struggles
- High dimensions (>20): The GP becomes unreliable when the input space is too large. The curse of dimensionality hits the surrogate model too.
- Discontinuous objectives: GPs assume smooth functions. If your objective has cliffs or plateaus, the surrogate will hallucinate smooth transitions.
- O(n³) scaling: Each BO iteration requires inverting the kernel matrix. After ~1,000 evaluations, the overhead dominates. This rarely matters in practice (if you can afford 1,000 expensive evaluations, you probably aren't budget-constrained).
- Noisy objectives: High noise makes the GP uncertain everywhere, reducing the information gain per evaluation.
The Modern Landscape
Production hyperparameter tuning tools have moved beyond vanilla GP-based BO:
- TPE (Tree-structured Parzen Estimators): Used by Optuna and Hyperopt. Replaces the GP with kernel density estimators. Scales better to high dimensions and handles categorical variables natively.
- SMAC: Uses random forests as the surrogate instead of GPs. Better for discrete/conditional search spaces.
- BOHB (Bayesian Optimization + HyperBand): Combines BO with multi-fidelity evaluation — cheap, short training runs screen candidates before expensive full runs.
- CMA-ES: An evolutionary strategy that maintains a covariance matrix. Not Bayesian, but effective for continuous optimization up to ~100 dimensions.
But the core idea remains the same across all these methods: don't waste evaluations. Use what you've learned to decide what to try next. Bayesian optimization made that idea precise, and understanding it from scratch gives you the intuition to evaluate all these modern tools.
References & Further Reading
- Snoek, Larochelle & Adams — "Practical Bayesian Optimization of Machine Learning Algorithms" (NeurIPS 2012) — the paper that popularized BO for hyperparameter tuning
- Shahriari et al. — "Taking the Human Out of the Loop: A Review of Bayesian Optimization" (2016) — comprehensive survey of the field
- Rasmussen & Williams — "Gaussian Processes for Machine Learning" (2006) — the GP bible, freely available online
- Bergstra & Bengio — "Random Search for Hyper-Parameter Optimization" (JMLR 2012) — why random search beats grid search
- Falkner et al. — "BOHB: Robust and Efficient Hyperparameter Optimization at Scale" (ICML 2018) — multi-fidelity Bayesian optimization
Related posts on DadOps: Gaussian Processes from Scratch, Bayesian Inference from Scratch, Optimizers from Scratch, Learning Rate Schedules from Scratch