Gaussian Processes from Scratch: The ML Model That Knows What It Doesn’t Know
Why Uncertainty Matters
Imagine you’re an engineer deciding whether a bridge can hold 10,000 kg. Your regression model says “yes, predicted capacity is 10,247 kg.” Great — but how confident is that prediction? If the model has seen thousands of similar bridges, maybe you trust it. If this bridge is made of an exotic alloy the model has never encountered, that single number is dangerously misleading.
Most machine learning models — neural networks, decision trees, linear regression — give you point predictions. They hand you a number and walk away. They don’t tell you whether they’re rock-solid confident or wildly guessing. In safety-critical domains like autonomous driving, medical diagnosis, and structural engineering, this is a serious problem.
Gaussian Processes (GPs) take a fundamentally different approach. Instead of learning a single function, a GP maintains a distribution over all possible functions consistent with the observed data. When you ask for a prediction, it doesn’t just say “10,247 kg” — it says “10,247 ± 1,800 kg with 95% confidence.” The uncertainty bands widen where data is sparse and tighten where data is dense. The model knows what it doesn’t know.
In this post, we’ll build Gaussian Processes from scratch. We’ll implement kernel functions, sample from GP priors, compute the exact posterior, optimize hyperparameters, handle classification, and tackle the scalability bottleneck — all in pure Python with NumPy. If you’ve read our posts on Bayesian inference or SVMs, you’ll see familiar themes: Bayesian updating and the kernel trick, now combined into one of the most elegant models in all of machine learning.
The Kernel Trick — Measuring Similarity Between Inputs
At the heart of every GP is a kernel function (also called a covariance function). A kernel k(x, x’) takes two inputs and returns a number measuring how “similar” they are. If two inputs are similar, their function values should be correlated. If they’re far apart, their function values should be nearly independent.
The kernel encodes your inductive bias — your prior belief about what the underlying function looks like. Choosing an RBF kernel says “I believe the function is infinitely smooth.” Choosing a Matérn kernel says “I believe it’s continuous but rough.” Choosing a periodic kernel says “I believe it repeats.” This is analogous to choosing an architecture in deep learning — both are structural assumptions about the problem.
Given N training points, we build an N×N kernel matrix K where Kij = k(xi, xj). This matrix captures all pairwise similarities and is the GP’s entire knowledge structure.
import numpy as np
def rbf_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
"""Radial Basis Function (squared exponential) kernel."""
sq_dist = np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1)
return signal_var * np.exp(-0.5 * sq_dist / length_scale**2)
def matern32_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
"""Matérn 3/2 kernel — once differentiable, realistic roughness."""
dist = np.sqrt(np.maximum(
np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1),
1e-12
))
r = np.sqrt(3) * dist / length_scale
return signal_var * (1 + r) * np.exp(-r)
def matern52_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
"""Matérn 5/2 kernel — twice differentiable, a popular middle ground."""
sq_dist = np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1)
dist = np.sqrt(np.maximum(sq_dist, 1e-12))
r = np.sqrt(5) * dist / length_scale
return signal_var * (1 + r + r**2 / 3) * np.exp(-r)
def periodic_kernel(X1, X2, length_scale=1.0, signal_var=1.0, period=1.0):
"""Periodic kernel — for functions that repeat."""
dist = np.sqrt(np.maximum(
np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1),
1e-12
))
return signal_var * np.exp(-2 * np.sin(np.pi * dist / period)**2 / length_scale**2)
# Demo: compute kernel values at varying distances
X_test = np.array([[0.0], [0.5], [1.0], [2.0], [4.0]])
X_origin = np.array([[0.0]])
print("Distance | RBF | Mat3/2 | Mat5/2 | Periodic(p=2)")
for x in X_test:
d = float(x[0])
r = float(rbf_kernel(x.reshape(1,-1), X_origin))
m3 = float(matern32_kernel(x.reshape(1,-1), X_origin))
m5 = float(matern52_kernel(x.reshape(1,-1), X_origin))
p = float(periodic_kernel(x.reshape(1,-1), X_origin, period=2.0))
print(f" {d:<6.1f} | {r:.4f} | {m3:.4f} | {m5:.4f} | {p:.4f}")
# Output:
# Distance | RBF | Mat3/2 | Mat5/2 | Periodic(p=2)
# 0.0 | 1.0000 | 1.0000 | 1.0000 | 1.0000
# 0.5 | 0.8825 | 0.7849 | 0.8286 | 0.3679
# 1.0 | 0.6065 | 0.4834 | 0.5240 | 0.1353
# 2.0 | 0.1353 | 0.1397 | 0.1387 | 1.0000
# 4.0 | 0.0003 | 0.0078 | 0.0048 | 1.0000
Notice how each kernel decays differently with distance. The RBF drops off as a smooth Gaussian bell curve. Matérn kernels decay more slowly, allowing rougher functions. The periodic kernel returns to 1.0 at multiples of the period — it “sees” inputs two periods apart as identical.
Sampling from the Prior — What GPs Believe Before Seeing Data
A GP prior defines a multivariate Gaussian distribution over function values at any finite set of input points: f ~ N(0, K). To sample a function from this prior, we pick a set of test points, build the kernel matrix, and draw a sample from the resulting multivariate Gaussian.
The trick is using the Cholesky decomposition: factor K = LLT, then generate f = Lz where z ~ N(0, I). This is numerically stable and efficient. We add a tiny jitter (10−8) to the diagonal to ensure K stays positive definite.
np.random.seed(42)
def sample_gp_prior(kernel_fn, X, n_samples=5, **kernel_params):
"""Sample functions from a GP prior."""
K = kernel_fn(X, X, **kernel_params)
K += 1e-8 * np.eye(len(X)) # jitter for numerical stability
L = np.linalg.cholesky(K) # K = L @ L.T
samples = []
for _ in range(n_samples):
z = np.random.randn(len(X))
f = L @ z # transform standard normal to GP sample
samples.append(f)
return samples
X = np.linspace(-5, 5, 200).reshape(-1, 1)
rbf_samples = sample_gp_prior(rbf_kernel, X, n_samples=3, length_scale=1.0)
matern_samples = sample_gp_prior(matern32_kernel, X, n_samples=3, length_scale=1.0)
periodic_samples = sample_gp_prior(periodic_kernel, X, n_samples=3, length_scale=1.0, period=2.0)
# Show a few values from each sample type
print("RBF sample (infinitely smooth):")
print(f" f([-3, -1, 0, 1, 3]) = [{rbf_samples[0][40]:.2f}, {rbf_samples[0][80]:.2f}, "
f"{rbf_samples[0][100]:.2f}, {rbf_samples[0][120]:.2f}, {rbf_samples[0][160]:.2f}]")
print("Matérn 3/2 sample (realistically rough):")
print(f" f([-3, -1, 0, 1, 3]) = [{matern_samples[0][40]:.2f}, {matern_samples[0][80]:.2f}, "
f"{matern_samples[0][100]:.2f}, {matern_samples[0][120]:.2f}, {matern_samples[0][160]:.2f}]")
print("Periodic sample (repeating pattern, period=2):")
print(f" f([-3, -1, 0, 1, 3]) = [{periodic_samples[0][40]:.2f}, {periodic_samples[0][80]:.2f}, "
f"{periodic_samples[0][100]:.2f}, {periodic_samples[0][120]:.2f}, {periodic_samples[0][160]:.2f}]")
# Output:
# RBF sample (infinitely smooth):
# f([-3, -1, 0, 1, 3]) = [-0.07, -1.56, -0.15, -0.13, 0.06]
# Matérn 3/2 sample (realistically rough):
# f([-3, -1, 0, 1, 3]) = [0.71, -0.80, -0.14, -0.27, -1.08]
# Periodic sample (repeating pattern, period=2):
# f([-3, -1, 0, 1, 3]) = [0.11, 0.10, -0.65, 0.08, 0.07]
These samples show what the GP “believes” before seeing any data. The RBF produces buttery-smooth curves. Matérn 3/2 gives wiggly but continuous paths — more like real-world signals. The periodic kernel generates repeating patterns. Notice how the periodic sample has nearly identical values at x=−3, −1, 1, and 3 (all spaced by the period = 2), while x=0 — a half-period away — is completely different (−0.65 vs ~0.1).
GP Regression — Conditioning on Observations
Now for the magic. Given N training observations (X, y) with noise variance σ2n, we want to predict at new test points X*. The GP posterior has closed-form solutions:
Posterior mean: μ* = K(X*, X) [K(X, X) + σ2nI]−1 y
Posterior covariance: Σ* = K(X*, X*) − K(X*, X) [K(X, X) + σ2nI]−1 K(X, X*)
The intuition: the posterior mean is a weighted combination of training outputs, where the weights come from kernel similarity between the test point and each training point. Points near many observations get confident predictions; points far from all data revert to the prior with wide uncertainty.
We use the Cholesky decomposition rather than direct matrix inversion for numerical stability:
def gp_posterior(X_train, y_train, X_test, kernel_fn,
noise_var=0.01, **kernel_params):
"""Compute GP posterior mean and variance at test points."""
n = len(X_train)
K = kernel_fn(X_train, X_train, **kernel_params)
K_noisy = K + noise_var * np.eye(n)
K_star = kernel_fn(X_test, X_train, **kernel_params)
K_ss = kernel_fn(X_test, X_test, **kernel_params)
# Cholesky solve: L @ L.T @ alpha = y
L = np.linalg.cholesky(K_noisy)
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
# Posterior mean
mu = K_star @ alpha
# Posterior variance
v = np.linalg.solve(L, K_star.T)
var = np.diag(K_ss) - np.sum(v**2, axis=0)
var = np.maximum(var, 1e-10) # clamp for numerical safety
return mu, var
# Generate noisy observations of sin(x)
np.random.seed(7)
X_train = np.array([-4, -2.5, -1.5, -0.5, 0.8, 2.0, 3.5]).reshape(-1, 1)
y_train = np.sin(X_train).ravel() + 0.1 * np.random.randn(len(X_train))
X_test = np.linspace(-5, 5, 201).reshape(-1, 1)
mu, var = gp_posterior(X_train, y_train, X_test, rbf_kernel,
noise_var=0.01, length_scale=1.0, signal_var=1.0)
std = np.sqrt(var)
# Show predictions at key points
test_indices = [0, 50, 100, 150, 200]
print(" x | true | GP mean | GP ±2σ")
print("--------+--------+---------+----------")
for i in test_indices:
x = X_test[i, 0]
print(f" {x:<5.1f} | {np.sin(x):<6.3f} | {mu[i]:<7.3f} | ±{2*std[i]:.3f}")
# RMSE comparison
gp_rmse = np.sqrt(np.mean((mu - np.sin(X_test).ravel())**2))
baseline_rmse = np.sqrt(np.mean((np.mean(y_train) - np.sin(X_test).ravel())**2))
print(f"\nGP RMSE: {gp_rmse:.4f}")
print(f"Baseline RMSE: {baseline_rmse:.4f}")
# Output:
# x | true | GP mean | GP ±2σ
# --------+--------+---------+----------
# -5.0 | 0.959 | 0.668 | ±1.546
# -2.5 | -0.598 | -0.639 | ±0.198
# 0.0 | 0.000 | -0.025 | ±0.391
# 2.5 | 0.598 | 0.528 | ±0.570
# 5.0 | -0.959 | -0.207 | ±1.880
#
# GP RMSE: 0.2146
# Baseline RMSE: 0.7274
Look at the uncertainty column. Near training points (x=−2.5), the ±2σ interval is just 0.198 — small, mostly reflecting observation noise. At the edges (x=5.0) where no training data exists, uncertainty balloons to 1.880. The GP is honestly saying “I don’t know what’s out here.”
Try It: GP Playground
Click the canvas to add observation points. The GP updates in real time — watch the uncertainty bands shrink near your data and grow where data is sparse.
Hyperparameter Learning — The Marginal Likelihood
Our GP has three hyperparameters: length scale l (how far “nearby” extends), signal variance σ2 (vertical scale of functions), and noise variance σ2n (observation noise). Get these wrong and everything falls apart — the wrong length scale can make your GP overfit wildly or miss all the structure in the data.
The Bayesian approach optimizes the log marginal likelihood:
log p(y | X, θ) = −½ yT Ky−1 y − ½ log|Ky| − n/2 log(2π)
This formula has a beautiful built-in complexity penalty. The first term measures data fit (how well the model explains the observations). The second term penalizes model complexity (the log-determinant of Ky grows with model flexibility). The GP automatically balances fitting the data versus keeping the model simple — no cross-validation needed.
from scipy.optimize import minimize
def neg_log_marginal_likelihood(params, X, y):
"""Negative log marginal likelihood for RBF kernel GP."""
length_scale = np.exp(params[0]) # optimize in log-space
signal_var = np.exp(params[1])
noise_var = np.exp(params[2])
K = rbf_kernel(X, X, length_scale=length_scale, signal_var=signal_var)
K_y = K + noise_var * np.eye(len(X))
try:
L = np.linalg.cholesky(K_y)
except np.linalg.LinAlgError:
return 1e6 # return large value if not positive definite
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
log_det = 2 * np.sum(np.log(np.diag(L)))
nll = 0.5 * y @ alpha + 0.5 * log_det + 0.5 * len(X) * np.log(2 * np.pi)
return nll
# Start from poor hyperparameters
init_params = np.log([0.5, 0.5, 0.1]) # l=0.5, sv=0.5, nv=0.1
result = minimize(neg_log_marginal_likelihood, init_params,
args=(X_train, y_train), method='L-BFGS-B')
opt_params = np.exp(result.x)
print("Initial hyperparameters (poor guesses):")
print(f" length_scale = 0.500, signal_var = 0.500, noise_var = 0.100")
print(f"\nOptimized hyperparameters:")
print(f" length_scale = {opt_params[0]:.3f}, signal_var = {opt_params[1]:.3f}, "
f"noise_var = {opt_params[2]:.6f}")
# Compare predictions
mu_bad, var_bad = gp_posterior(X_train, y_train, X_test, rbf_kernel,
noise_var=0.1, length_scale=0.5, signal_var=0.5)
mu_opt, var_opt = gp_posterior(X_train, y_train, X_test, rbf_kernel,
noise_var=opt_params[2],
length_scale=opt_params[0],
signal_var=opt_params[1])
rmse_bad = np.sqrt(np.mean((mu_bad - np.sin(X_test).ravel())**2))
rmse_opt = np.sqrt(np.mean((mu_opt - np.sin(X_test).ravel())**2))
print(f"\nRMSE with poor hyperparams: {rmse_bad:.4f}")
print(f"RMSE with optimized hyperparams: {rmse_opt:.4f}")
# Output:
# Initial hyperparameters (poor guesses):
# length_scale = 0.500, signal_var = 0.500, noise_var = 0.100
#
# Optimized hyperparameters:
# length_scale = 2.467, signal_var = 3.126, noise_var = 0.000982
#
# RMSE with poor hyperparams: 0.3749
# RMSE with optimized hyperparams: 0.2298
The optimizer found a length scale of 2.47 — sensible for sin(x) which has a wavelength of 2π ≈ 6.3 (a length scale of ~2.5 captures the broad structure). The signal variance of 3.13 accommodates the full range of sine’s oscillations. And the noise variance dropped to near zero (0.001), correctly identifying that our observations have very little noise. The RMSE improved from 0.37 to 0.23.
GP Classification — Beyond Regression
GP regression assumes Gaussian noise, but what about binary classification? The idea: run a latent GP through a sigmoid. We have a latent function f(x) ~ GP, and the class probability is p(y=1|x) = σ(f(x)) where σ is the logistic sigmoid. The problem? The posterior is no longer Gaussian — we need an approximation.
The Laplace approximation finds the mode of the posterior (the MAP estimate) and fits a Gaussian around it. We iteratively solve for the mode using Newton’s method, then use the local curvature to estimate uncertainty:
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
def gp_classification_laplace(X_train, y_train, X_test, kernel_fn,
n_iters=20, **kernel_params):
"""GP binary classification with Laplace approximation."""
n = len(X_train)
K = kernel_fn(X_train, X_train, **kernel_params)
K += 1e-6 * np.eye(n)
# Newton’s method to find the MAP estimate of latent f
f = np.zeros(n)
for _ in range(n_iters):
pi = sigmoid(f)
W = pi * (1 - pi) # diagonal of the Hessian
W_sqrt = np.sqrt(W)
B = np.eye(n) + np.outer(W_sqrt, W_sqrt) * K
L = np.linalg.cholesky(B)
b = W * f + (y_train - pi)
a = b - W_sqrt * np.linalg.solve(
L.T, np.linalg.solve(L, W_sqrt * (K @ b)))
f = K @ a
# Predict at test points
pi_final = sigmoid(f)
K_star = kernel_fn(X_test, X_train, **kernel_params)
f_mean = K_star @ (y_train - pi_final)
# Predictive variance (approximate)
W_final = pi_final * (1 - pi_final)
W_sqrt_f = np.sqrt(W_final)
B_f = np.eye(n) + np.outer(W_sqrt_f, W_sqrt_f) * K
L_f = np.linalg.cholesky(B_f)
v = np.linalg.solve(L_f, (W_sqrt_f[:, None] * K_star.T))
K_ss_diag = np.array([kernel_fn(X_test[i:i+1], X_test[i:i+1], **kernel_params)[0,0]
for i in range(len(X_test))])
f_var = K_ss_diag - np.sum(v**2, axis=0)
# Convert latent predictions to class probabilities
# Probit approximation: integrate sigmoid against Gaussian
kappa = 1.0 / np.sqrt(1.0 + np.pi * f_var / 8.0)
prob = sigmoid(kappa * f_mean)
return prob, f_mean, f_var
# Synthetic 1D classification dataset
np.random.seed(99)
X_cls = np.sort(np.random.uniform(-4, 4, 20)).reshape(-1, 1)
y_cls = (np.sin(X_cls.ravel()) > 0.3).astype(float)
# Add some noise near boundary
y_cls[8] = 1 - y_cls[8]
X_test_cls = np.linspace(-5, 5, 100).reshape(-1, 1)
probs, _, _ = gp_classification_laplace(
X_cls, y_cls, X_test_cls, rbf_kernel,
length_scale=1.0, signal_var=1.5)
# Show predictions at key points
print(" x | GP prob | Confidence")
print("-------+---------+-----------")
for i in [5, 25, 50, 75, 95]:
x = X_test_cls[i, 0]
p = probs[i]
conf = max(p, 1-p)
label = "class 1" if p > 0.5 else "class 0"
print(f" {x:<4.1f} | {p:.3f} | {conf:.1%} ({label})")
# Output:
# x | GP prob | Confidence
# -------+---------+-----------
# -4.5 | 0.653 | 65.3% (class 1)
# -2.5 | 0.401 | 59.9% (class 0)
# 0.1 | 0.448 | 55.2% (class 0)
# 2.6 | 0.712 | 71.2% (class 1)
# 4.6 | 0.334 | 66.6% (class 0)
Notice the GP classifier’s honesty near the decision boundary. At x≈0, it gives 44.8% — essentially a coin flip, reflecting genuine uncertainty. Compare this with logistic regression, which would give overconfident probabilities at most points. Far from the boundary, the GP is more confident (65–71%), but never claims to be 99.9% sure when the data doesn’t warrant it.
Scaling GPs — Sparse Approximations
There’s a catch. The Cholesky decomposition of an N×N kernel matrix is O(N³). With 100 points, that’s instant. With 1,000, it’s a second. With 10,000, it’s minutes. With 100,000, forget it. This cubic bottleneck is the main reason GPs aren’t used more widely.
The solution: sparse approximations. Instead of using all N training points, we select m << N inducing points and approximate the full kernel matrix. The Nyström approximation replaces K with Knm Kmm−1 Kmn, reducing the cost to O(Nm²).
import time
def sparse_gp_nystrom(X_train, y_train, X_test, kernel_fn,
n_inducing=50, noise_var=0.01, **kernel_params):
"""Sparse GP using Nyström approximation with inducing points."""
n = len(X_train)
# Select inducing points evenly spaced from training range
indices = np.linspace(0, n - 1, n_inducing).astype(int)
X_m = X_train[indices]
K_mm = kernel_fn(X_m, X_m, **kernel_params) + 1e-6 * np.eye(n_inducing)
K_nm = kernel_fn(X_train, X_m, **kernel_params)
K_star_m = kernel_fn(X_test, X_m, **kernel_params)
# Nyström: K ≈ K_nm @ inv(K_mm) @ K_mn
L_mm = np.linalg.cholesky(K_mm)
V = np.linalg.solve(L_mm, K_nm.T) # m x n
# Approximate posterior
Lambda = V @ V.T + noise_var * np.eye(n_inducing)
L_lam = np.linalg.cholesky(Lambda)
alpha = np.linalg.solve(L_lam.T, np.linalg.solve(L_lam, V @ y_train))
V_star = np.linalg.solve(L_mm, K_star_m.T)
mu = V_star.T @ alpha
w = np.linalg.solve(L_lam, V_star)
K_ss_diag = np.array([kernel_fn(X_test[i:i+1], X_test[i:i+1], **kernel_params)[0,0]
for i in range(len(X_test))])
var = K_ss_diag - np.sum(V_star**2, axis=0) + np.sum(w**2, axis=0)
var = np.maximum(var, 1e-10)
return mu, var
# Generate a larger dataset
np.random.seed(123)
n_large = 2000
X_large = np.sort(np.random.uniform(-5, 5, n_large)).reshape(-1, 1)
y_large = np.sin(X_large).ravel() + 0.1 * np.random.randn(n_large)
X_test_sparse = np.linspace(-5, 5, 200).reshape(-1, 1)
# Sparse GP with 50 inducing points
t0 = time.time()
mu_sparse, var_sparse = sparse_gp_nystrom(
X_large, y_large, X_test_sparse, rbf_kernel,
n_inducing=50, noise_var=0.01, length_scale=1.0, signal_var=1.0)
t_sparse = time.time() - t0
# Exact GP on same data
t0 = time.time()
mu_exact, var_exact = gp_posterior(
X_large, y_large, X_test_sparse, rbf_kernel,
noise_var=0.01, length_scale=1.0, signal_var=1.0)
t_exact = time.time() - t0
rmse_sparse = np.sqrt(np.mean((mu_sparse - np.sin(X_test_sparse).ravel())**2))
rmse_exact = np.sqrt(np.mean((mu_exact - np.sin(X_test_sparse).ravel())**2))
print(f"Dataset size: {n_large} points")
print(f"Inducing points: 50")
print(f"\n{'Method':{' '}<20} | {'RMSE':<8} | {'Time':<8}")
print(f"{'-'*20}-+-{'-'*8}-+-{'-'*8}")
print(f"{'Exact GP':{' '}<20} | {rmse_exact:.4f} | {t_exact:.3f}s")
print(f"{'Sparse GP (m=50)':{' '}<20} | {rmse_sparse:.4f} | {t_sparse:.3f}s")
print(f"\nSpeedup: {t_exact/t_sparse:.1f}x with "
f"{abs(rmse_sparse - rmse_exact)/rmse_exact:.1%} RMSE difference")
# Output:
# Dataset size: 2000 points
# Inducing points: 50
#
# Method | RMSE | Time
# ---------------------+----------+---------
# Exact GP | 0.0085 | 8.673s
# Sparse GP (m=50) | 0.0085 | 0.030s
#
# Speedup: 292.3x with 0.0% RMSE difference
With just 50 inducing points (2.5% of the data), the sparse GP achieves identical accuracy at nearly 300× the speed. For even larger datasets, modern approaches like variational sparse GPs (Titsias 2009), random Fourier features, and GPyTorch can handle millions of points while preserving uncertainty estimates.
Try It: Kernel Explorer
Explore what different kernels “believe” about functions. The top shows 3 random function samples from the GP prior. The bottom shows the kernel matrix heatmap — brighter means higher similarity.
Where GPs Shine (and Where They Don’t)
Gaussian Processes are the ideal choice when you need calibrated uncertainty, have small-to-medium datasets, and the input dimensionality is moderate (typically <20). They’re the backbone of Bayesian optimization (where uncertainty guides exploration), active learning (where we query the most uncertain points), and scientific modeling (where honest error bars matter more than raw accuracy).
Their limitations are real: O(N³) scaling means exact GPs struggle beyond ~10,000 points, they assume smoothness dictated by the kernel choice, and they can’t easily learn hierarchical features like deep neural networks. But when the problem fits — and it does surprisingly often — nothing else gives you such elegant, well-calibrated predictions with built-in uncertainty quantification.
We’ve built the full pipeline: kernel functions that encode prior beliefs, GP priors that sample random functions, exact posterior inference via Cholesky decomposition, hyperparameter optimization via marginal likelihood, classification via the Laplace approximation, and sparse approximations that break the cubic bottleneck. That’s a complete GP toolkit, from scratch.
References & Further Reading
- Carl Edward Rasmussen & Christopher K. I. Williams — “Gaussian Processes for Machine Learning” — The definitive GP textbook, freely available online (MIT Press, 2006)
- David MacKay — “Information Theory, Inference and Learning Algorithms” — Chapter 45 gives a wonderful introduction to GPs (Cambridge University Press, 2003)
- Michalis Titsias — “Variational Learning of Inducing Variables in Sparse Gaussian Processes” — The foundational paper on variational sparse GPs (AISTATS 2009)
- Andrew Gordon Wilson & Hannes Nickisch — “Kernel Interpolation for Scalable Structured Gaussian Processes” — KISS-GP for scaling to large datasets (ICML 2015)
- James Hensman, Nicoló Fusi & Neil D. Lawrence — “Gaussian Processes for Big Data” — Stochastic variational inference for massive GP models (UAI 2013)