Gaussian Processes from Scratch: The ML Model That Knows What It Doesn’t Know

February 27, 2026 · Elementary · 15 min read

Why Uncertainty Matters

Imagine you’re an engineer deciding whether a bridge can hold 10,000 kg. Your regression model says “yes, predicted capacity is 10,247 kg.” Great — but how confident is that prediction? If the model has seen thousands of similar bridges, maybe you trust it. If this bridge is made of an exotic alloy the model has never encountered, that single number is dangerously misleading.

Most machine learning models — neural networks, decision trees, linear regression — give you point predictions. They hand you a number and walk away. They don’t tell you whether they’re rock-solid confident or wildly guessing. In safety-critical domains like autonomous driving, medical diagnosis, and structural engineering, this is a serious problem.

Gaussian Processes (GPs) take a fundamentally different approach. Instead of learning a single function, a GP maintains a distribution over all possible functions consistent with the observed data. When you ask for a prediction, it doesn’t just say “10,247 kg” — it says “10,247 ± 1,800 kg with 95% confidence.” The uncertainty bands widen where data is sparse and tighten where data is dense. The model knows what it doesn’t know.

In this post, we’ll build Gaussian Processes from scratch. We’ll implement kernel functions, sample from GP priors, compute the exact posterior, optimize hyperparameters, handle classification, and tackle the scalability bottleneck — all in pure Python with NumPy. If you’ve read our posts on Bayesian inference or SVMs, you’ll see familiar themes: Bayesian updating and the kernel trick, now combined into one of the most elegant models in all of machine learning.

The Kernel Trick — Measuring Similarity Between Inputs

At the heart of every GP is a kernel function (also called a covariance function). A kernel k(x, x’) takes two inputs and returns a number measuring how “similar” they are. If two inputs are similar, their function values should be correlated. If they’re far apart, their function values should be nearly independent.

The kernel encodes your inductive bias — your prior belief about what the underlying function looks like. Choosing an RBF kernel says “I believe the function is infinitely smooth.” Choosing a Matérn kernel says “I believe it’s continuous but rough.” Choosing a periodic kernel says “I believe it repeats.” This is analogous to choosing an architecture in deep learning — both are structural assumptions about the problem.

Given N training points, we build an N×N kernel matrix K where K_ij = k(x_i, x_j). This matrix captures all pairwise similarities and is the GP’s entire knowledge structure.

import numpy as np

def rbf_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
    """Radial Basis Function (squared exponential) kernel."""
    sq_dist = np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1)
    return signal_var * np.exp(-0.5 * sq_dist / length_scale**2)

def matern32_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
    """Matérn 3/2 kernel — once differentiable, realistic roughness."""
    dist = np.sqrt(np.maximum(
        np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1),
        1e-12
    ))
    r = np.sqrt(3) * dist / length_scale
    return signal_var * (1 + r) * np.exp(-r)

def matern52_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
    """Matérn 5/2 kernel — twice differentiable, a popular middle ground."""
    sq_dist = np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1)
    dist = np.sqrt(np.maximum(sq_dist, 1e-12))
    r = np.sqrt(5) * dist / length_scale
    return signal_var * (1 + r + r**2 / 3) * np.exp(-r)

def periodic_kernel(X1, X2, length_scale=1.0, signal_var=1.0, period=1.0):
    """Periodic kernel — for functions that repeat."""
    dist = np.sqrt(np.maximum(
        np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1),
        1e-12
    ))
    return signal_var * np.exp(-2 * np.sin(np.pi * dist / period)**2 / length_scale**2)

# Demo: compute kernel values at varying distances
X_test = np.array([[0.0], [0.5], [1.0], [2.0], [4.0]])
X_origin = np.array([[0.0]])

print("Distance | RBF    | Mat3/2 | Mat5/2 | Periodic(p=2)")
for x in X_test:
    d = float(x[0])
    r = float(rbf_kernel(x.reshape(1,-1), X_origin))
    m3 = float(matern32_kernel(x.reshape(1,-1), X_origin))
    m5 = float(matern52_kernel(x.reshape(1,-1), X_origin))
    p = float(periodic_kernel(x.reshape(1,-1), X_origin, period=2.0))
    print(f"  {d:<6.1f} | {r:.4f} | {m3:.4f} | {m5:.4f} | {p:.4f}")

# Output:
# Distance | RBF    | Mat3/2 | Mat5/2 | Periodic(p=2)
#   0.0    | 1.0000 | 1.0000 | 1.0000 | 1.0000
#   0.5    | 0.8825 | 0.7849 | 0.8286 | 0.3679
#   1.0    | 0.6065 | 0.4834 | 0.5240 | 0.1353
#   2.0    | 0.1353 | 0.1397 | 0.1387 | 1.0000
#   4.0    | 0.0003 | 0.0078 | 0.0048 | 1.0000

Notice how each kernel decays differently with distance. The RBF drops off as a smooth Gaussian bell curve. Matérn kernels decay more slowly, allowing rougher functions. The periodic kernel returns to 1.0 at multiples of the period — it “sees” inputs two periods apart as identical.

Sampling from the Prior — What GPs Believe Before Seeing Data

A GP prior defines a multivariate Gaussian distribution over function values at any finite set of input points: f ~ N(0, K). To sample a function from this prior, we pick a set of test points, build the kernel matrix, and draw a sample from the resulting multivariate Gaussian.

The trick is using the Cholesky decomposition: factor K = LL^T, then generate f = Lz where z ~ N(0, I). This is numerically stable and efficient. We add a tiny jitter (10⁻⁸) to the diagonal to ensure K stays positive definite.

np.random.seed(42)

def sample_gp_prior(kernel_fn, X, n_samples=5, **kernel_params):
    """Sample functions from a GP prior."""
    K = kernel_fn(X, X, **kernel_params)
    K += 1e-8 * np.eye(len(X))  # jitter for numerical stability
    L = np.linalg.cholesky(K)   # K = L @ L.T

    samples = []
    for _ in range(n_samples):
        z = np.random.randn(len(X))
        f = L @ z  # transform standard normal to GP sample
        samples.append(f)
    return samples

X = np.linspace(-5, 5, 200).reshape(-1, 1)

rbf_samples = sample_gp_prior(rbf_kernel, X, n_samples=3, length_scale=1.0)
matern_samples = sample_gp_prior(matern32_kernel, X, n_samples=3, length_scale=1.0)
periodic_samples = sample_gp_prior(periodic_kernel, X, n_samples=3, length_scale=1.0, period=2.0)

# Show a few values from each sample type
print("RBF sample (infinitely smooth):")
print(f"  f([-3, -1, 0, 1, 3]) = [{rbf_samples[0][40]:.2f}, {rbf_samples[0][80]:.2f}, "
      f"{rbf_samples[0][100]:.2f}, {rbf_samples[0][120]:.2f}, {rbf_samples[0][160]:.2f}]")

print("Matérn 3/2 sample (realistically rough):")
print(f"  f([-3, -1, 0, 1, 3]) = [{matern_samples[0][40]:.2f}, {matern_samples[0][80]:.2f}, "
      f"{matern_samples[0][100]:.2f}, {matern_samples[0][120]:.2f}, {matern_samples[0][160]:.2f}]")

print("Periodic sample (repeating pattern, period=2):")
print(f"  f([-3, -1, 0, 1, 3]) = [{periodic_samples[0][40]:.2f}, {periodic_samples[0][80]:.2f}, "
      f"{periodic_samples[0][100]:.2f}, {periodic_samples[0][120]:.2f}, {periodic_samples[0][160]:.2f}]")

# Output:
# RBF sample (infinitely smooth):
#   f([-3, -1, 0, 1, 3]) = [-0.07, -1.56, -0.15, -0.13, 0.06]
# Matérn 3/2 sample (realistically rough):
#   f([-3, -1, 0, 1, 3]) = [0.71, -0.80, -0.14, -0.27, -1.08]
# Periodic sample (repeating pattern, period=2):
#   f([-3, -1, 0, 1, 3]) = [0.11, 0.10, -0.65, 0.08, 0.07]

These samples show what the GP “believes” before seeing any data. The RBF produces buttery-smooth curves. Matérn 3/2 gives wiggly but continuous paths — more like real-world signals. The periodic kernel generates repeating patterns. Notice how the periodic sample has nearly identical values at x=−3, −1, 1, and 3 (all spaced by the period = 2), while x=0 — a half-period away — is completely different (−0.65 vs ~0.1).

GP Regression — Conditioning on Observations

Now for the magic. Given N training observations (X, y) with noise variance σ²_n, we want to predict at new test points X_*. The GP posterior has closed-form solutions:

Posterior mean: μ_* = K(X_*, X) [K(X, X) + σ²_nI]⁻¹ y

Posterior covariance: Σ_* = K(X_*, X_*) − K(X_*, X) [K(X, X) + σ²_nI]⁻¹ K(X, X_*)

The intuition: the posterior mean is a weighted combination of training outputs, where the weights come from kernel similarity between the test point and each training point. Points near many observations get confident predictions; points far from all data revert to the prior with wide uncertainty.

We use the Cholesky decomposition rather than direct matrix inversion for numerical stability:

def gp_posterior(X_train, y_train, X_test, kernel_fn,
                 noise_var=0.01, **kernel_params):
    """Compute GP posterior mean and variance at test points."""
    n = len(X_train)
    K = kernel_fn(X_train, X_train, **kernel_params)
    K_noisy = K + noise_var * np.eye(n)
    K_star = kernel_fn(X_test, X_train, **kernel_params)
    K_ss = kernel_fn(X_test, X_test, **kernel_params)

    # Cholesky solve: L @ L.T @ alpha = y
    L = np.linalg.cholesky(K_noisy)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))

    # Posterior mean
    mu = K_star @ alpha

    # Posterior variance
    v = np.linalg.solve(L, K_star.T)
    var = np.diag(K_ss) - np.sum(v**2, axis=0)
    var = np.maximum(var, 1e-10)  # clamp for numerical safety

    return mu, var

# Generate noisy observations of sin(x)
np.random.seed(7)
X_train = np.array([-4, -2.5, -1.5, -0.5, 0.8, 2.0, 3.5]).reshape(-1, 1)
y_train = np.sin(X_train).ravel() + 0.1 * np.random.randn(len(X_train))

X_test = np.linspace(-5, 5, 201).reshape(-1, 1)
mu, var = gp_posterior(X_train, y_train, X_test, rbf_kernel,
                       noise_var=0.01, length_scale=1.0, signal_var=1.0)
std = np.sqrt(var)

# Show predictions at key points
test_indices = [0, 50, 100, 150, 200]
print("  x     | true   | GP mean | GP ±2σ")
print("--------+--------+---------+----------")
for i in test_indices:
    x = X_test[i, 0]
    print(f"  {x:<5.1f} | {np.sin(x):<6.3f} | {mu[i]:<7.3f} | ±{2*std[i]:.3f}")

# RMSE comparison
gp_rmse = np.sqrt(np.mean((mu - np.sin(X_test).ravel())**2))
baseline_rmse = np.sqrt(np.mean((np.mean(y_train) - np.sin(X_test).ravel())**2))
print(f"\nGP RMSE:       {gp_rmse:.4f}")
print(f"Baseline RMSE: {baseline_rmse:.4f}")

# Output:
#   x     | true   | GP mean | GP ±2σ
# --------+--------+---------+----------
#   -5.0  | 0.959  | 0.668   | ±1.546
#   -2.5  | -0.598 | -0.639  | ±0.198
#    0.0  | 0.000  | -0.025  | ±0.391
#    2.5  | 0.598  | 0.528   | ±0.570
#    5.0  | -0.959 | -0.207  | ±1.880
#
# GP RMSE:       0.2146
# Baseline RMSE: 0.7274

Look at the uncertainty column. Near training points (x=−2.5), the ±2σ interval is just 0.198 — small, mostly reflecting observation noise. At the edges (x=5.0) where no training data exists, uncertainty balloons to 1.880. The GP is honestly saying “I don’t know what’s out here.”

Try It: GP Playground

Click the canvas to add observation points. The GP updates in real time — watch the uncertainty bands shrink near your data and grow where data is sparse.

Length scale: 1.00

Noise: 0.010

Click anywhere on the canvas to add observations. The GP posterior mean (blue line) and ±2σ uncertainty band (shaded region) update instantly.

Hyperparameter Learning — The Marginal Likelihood

Our GP has three hyperparameters: length scale l (how far “nearby” extends), signal variance σ² (vertical scale of functions), and noise variance σ²_n (observation noise). Get these wrong and everything falls apart — the wrong length scale can make your GP overfit wildly or miss all the structure in the data.

The Bayesian approach optimizes the log marginal likelihood:

log p(y | X, θ) = −½ y^T K_y⁻¹ y − ½ log|K_y| − n/2 log(2π)

This formula has a beautiful built-in complexity penalty. The first term measures data fit (how well the model explains the observations). The second term penalizes model complexity (the log-determinant of K_y grows with model flexibility). The GP automatically balances fitting the data versus keeping the model simple — no cross-validation needed.

from scipy.optimize import minimize

def neg_log_marginal_likelihood(params, X, y):
    """Negative log marginal likelihood for RBF kernel GP."""
    length_scale = np.exp(params[0])  # optimize in log-space
    signal_var = np.exp(params[1])
    noise_var = np.exp(params[2])

    K = rbf_kernel(X, X, length_scale=length_scale, signal_var=signal_var)
    K_y = K + noise_var * np.eye(len(X))

    try:
        L = np.linalg.cholesky(K_y)
    except np.linalg.LinAlgError:
        return 1e6  # return large value if not positive definite

    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
    log_det = 2 * np.sum(np.log(np.diag(L)))

    nll = 0.5 * y @ alpha + 0.5 * log_det + 0.5 * len(X) * np.log(2 * np.pi)
    return nll

# Start from poor hyperparameters
init_params = np.log([0.5, 0.5, 0.1])  # l=0.5, sv=0.5, nv=0.1
result = minimize(neg_log_marginal_likelihood, init_params,
                  args=(X_train, y_train), method='L-BFGS-B')
opt_params = np.exp(result.x)

print("Initial hyperparameters (poor guesses):")
print(f"  length_scale = 0.500, signal_var = 0.500, noise_var = 0.100")
print(f"\nOptimized hyperparameters:")
print(f"  length_scale = {opt_params[0]:.3f}, signal_var = {opt_params[1]:.3f}, "
      f"noise_var = {opt_params[2]:.6f}")

# Compare predictions
mu_bad, var_bad = gp_posterior(X_train, y_train, X_test, rbf_kernel,
                               noise_var=0.1, length_scale=0.5, signal_var=0.5)
mu_opt, var_opt = gp_posterior(X_train, y_train, X_test, rbf_kernel,
                               noise_var=opt_params[2],
                               length_scale=opt_params[0],
                               signal_var=opt_params[1])

rmse_bad = np.sqrt(np.mean((mu_bad - np.sin(X_test).ravel())**2))
rmse_opt = np.sqrt(np.mean((mu_opt - np.sin(X_test).ravel())**2))
print(f"\nRMSE with poor hyperparams:      {rmse_bad:.4f}")
print(f"RMSE with optimized hyperparams: {rmse_opt:.4f}")

# Output:
# Initial hyperparameters (poor guesses):
#   length_scale = 0.500, signal_var = 0.500, noise_var = 0.100
#
# Optimized hyperparameters:
#   length_scale = 2.467, signal_var = 3.126, noise_var = 0.000982
#
# RMSE with poor hyperparams:      0.3749
# RMSE with optimized hyperparams: 0.2298

The optimizer found a length scale of 2.47 — sensible for sin(x) which has a wavelength of 2π ≈ 6.3 (a length scale of ~2.5 captures the broad structure). The signal variance of 3.13 accommodates the full range of sine’s oscillations. And the noise variance dropped to near zero (0.001), correctly identifying that our observations have very little noise. The RMSE improved from 0.37 to 0.23.

GP Classification — Beyond Regression

GP regression assumes Gaussian noise, but what about binary classification? The idea: run a latent GP through a sigmoid. We have a latent function f(x) ~ GP, and the class probability is p(y=1|x) = σ(f(x)) where σ is the logistic sigmoid. The problem? The posterior is no longer Gaussian — we need an approximation.

The Laplace approximation finds the mode of the posterior (the MAP estimate) and fits a Gaussian around it. We iteratively solve for the mode using Newton’s method, then use the local curvature to estimate uncertainty:

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

def gp_classification_laplace(X_train, y_train, X_test, kernel_fn,
                               n_iters=20, **kernel_params):
    """GP binary classification with Laplace approximation."""
    n = len(X_train)
    K = kernel_fn(X_train, X_train, **kernel_params)
    K += 1e-6 * np.eye(n)

    # Newton’s method to find the MAP estimate of latent f
    f = np.zeros(n)
    for _ in range(n_iters):
        pi = sigmoid(f)
        W = pi * (1 - pi)  # diagonal of the Hessian
        W_sqrt = np.sqrt(W)

        B = np.eye(n) + np.outer(W_sqrt, W_sqrt) * K
        L = np.linalg.cholesky(B)
        b = W * f + (y_train - pi)
        a = b - W_sqrt * np.linalg.solve(
            L.T, np.linalg.solve(L, W_sqrt * (K @ b)))
        f = K @ a

    # Predict at test points
    pi_final = sigmoid(f)
    K_star = kernel_fn(X_test, X_train, **kernel_params)
    f_mean = K_star @ (y_train - pi_final)

    # Predictive variance (approximate)
    W_final = pi_final * (1 - pi_final)
    W_sqrt_f = np.sqrt(W_final)
    B_f = np.eye(n) + np.outer(W_sqrt_f, W_sqrt_f) * K
    L_f = np.linalg.cholesky(B_f)
    v = np.linalg.solve(L_f, (W_sqrt_f[:, None] * K_star.T))
    K_ss_diag = np.array([kernel_fn(X_test[i:i+1], X_test[i:i+1], **kernel_params)[0,0]
                          for i in range(len(X_test))])
    f_var = K_ss_diag - np.sum(v**2, axis=0)

    # Convert latent predictions to class probabilities
    # Probit approximation: integrate sigmoid against Gaussian
    kappa = 1.0 / np.sqrt(1.0 + np.pi * f_var / 8.0)
    prob = sigmoid(kappa * f_mean)

    return prob, f_mean, f_var

# Synthetic 1D classification dataset
np.random.seed(99)
X_cls = np.sort(np.random.uniform(-4, 4, 20)).reshape(-1, 1)
y_cls = (np.sin(X_cls.ravel()) > 0.3).astype(float)
# Add some noise near boundary
y_cls[8] = 1 - y_cls[8]

X_test_cls = np.linspace(-5, 5, 100).reshape(-1, 1)
probs, _, _ = gp_classification_laplace(
    X_cls, y_cls, X_test_cls, rbf_kernel,
    length_scale=1.0, signal_var=1.5)

# Show predictions at key points
print("  x    | GP prob | Confidence")
print("-------+---------+-----------")
for i in [5, 25, 50, 75, 95]:
    x = X_test_cls[i, 0]
    p = probs[i]
    conf = max(p, 1-p)
    label = "class 1" if p > 0.5 else "class 0"
    print(f"  {x:<4.1f} | {p:.3f}   | {conf:.1%} ({label})")

# Output:
#   x    | GP prob | Confidence
# -------+---------+-----------
#   -4.5 | 0.653   | 65.3% (class 1)
#   -2.5 | 0.401   | 59.9% (class 0)
#    0.1 | 0.448   | 55.2% (class 0)
#    2.6 | 0.712   | 71.2% (class 1)
#    4.6 | 0.334   | 66.6% (class 0)

Notice the GP classifier’s honesty near the decision boundary. At x≈0, it gives 44.8% — essentially a coin flip, reflecting genuine uncertainty. Compare this with logistic regression, which would give overconfident probabilities at most points. Far from the boundary, the GP is more confident (65–71%), but never claims to be 99.9% sure when the data doesn’t warrant it.

Scaling GPs — Sparse Approximations

There’s a catch. The Cholesky decomposition of an N×N kernel matrix is O(N³). With 100 points, that’s instant. With 1,000, it’s a second. With 10,000, it’s minutes. With 100,000, forget it. This cubic bottleneck is the main reason GPs aren’t used more widely.

The solution: sparse approximations. Instead of using all N training points, we select m << N inducing points and approximate the full kernel matrix. The Nyström approximation replaces K with K_nm K_mm⁻¹ K_mn, reducing the cost to O(Nm²).

import time

def sparse_gp_nystrom(X_train, y_train, X_test, kernel_fn,
                       n_inducing=50, noise_var=0.01, **kernel_params):
    """Sparse GP using Nyström approximation with inducing points."""
    n = len(X_train)
    # Select inducing points evenly spaced from training range
    indices = np.linspace(0, n - 1, n_inducing).astype(int)
    X_m = X_train[indices]

    K_mm = kernel_fn(X_m, X_m, **kernel_params) + 1e-6 * np.eye(n_inducing)
    K_nm = kernel_fn(X_train, X_m, **kernel_params)
    K_star_m = kernel_fn(X_test, X_m, **kernel_params)

    # Nyström: K ≈ K_nm @ inv(K_mm) @ K_mn
    L_mm = np.linalg.cholesky(K_mm)
    V = np.linalg.solve(L_mm, K_nm.T)  # m x n

    # Approximate posterior
    Lambda = V @ V.T + noise_var * np.eye(n_inducing)
    L_lam = np.linalg.cholesky(Lambda)
    alpha = np.linalg.solve(L_lam.T, np.linalg.solve(L_lam, V @ y_train))

    V_star = np.linalg.solve(L_mm, K_star_m.T)
    mu = V_star.T @ alpha

    w = np.linalg.solve(L_lam, V_star)
    K_ss_diag = np.array([kernel_fn(X_test[i:i+1], X_test[i:i+1], **kernel_params)[0,0]
                          for i in range(len(X_test))])
    var = K_ss_diag - np.sum(V_star**2, axis=0) + np.sum(w**2, axis=0)
    var = np.maximum(var, 1e-10)

    return mu, var

# Generate a larger dataset
np.random.seed(123)
n_large = 2000
X_large = np.sort(np.random.uniform(-5, 5, n_large)).reshape(-1, 1)
y_large = np.sin(X_large).ravel() + 0.1 * np.random.randn(n_large)
X_test_sparse = np.linspace(-5, 5, 200).reshape(-1, 1)

# Sparse GP with 50 inducing points
t0 = time.time()
mu_sparse, var_sparse = sparse_gp_nystrom(
    X_large, y_large, X_test_sparse, rbf_kernel,
    n_inducing=50, noise_var=0.01, length_scale=1.0, signal_var=1.0)
t_sparse = time.time() - t0

# Exact GP on same data
t0 = time.time()
mu_exact, var_exact = gp_posterior(
    X_large, y_large, X_test_sparse, rbf_kernel,
    noise_var=0.01, length_scale=1.0, signal_var=1.0)
t_exact = time.time() - t0

rmse_sparse = np.sqrt(np.mean((mu_sparse - np.sin(X_test_sparse).ravel())**2))
rmse_exact = np.sqrt(np.mean((mu_exact - np.sin(X_test_sparse).ravel())**2))

print(f"Dataset size: {n_large} points")
print(f"Inducing points: 50")
print(f"\n{'Method':{' '}<20} | {'RMSE':<8} | {'Time':<8}")
print(f"{'-'*20}-+-{'-'*8}-+-{'-'*8}")
print(f"{'Exact GP':{' '}<20} | {rmse_exact:.4f}  | {t_exact:.3f}s")
print(f"{'Sparse GP (m=50)':{' '}<20} | {rmse_sparse:.4f}  | {t_sparse:.3f}s")
print(f"\nSpeedup: {t_exact/t_sparse:.1f}x with "
      f"{abs(rmse_sparse - rmse_exact)/rmse_exact:.1%} RMSE difference")

# Output:
# Dataset size: 2000 points
# Inducing points: 50
#
# Method               | RMSE     | Time
# ---------------------+----------+---------
# Exact GP             | 0.0085   | 8.673s
# Sparse GP (m=50)     | 0.0085   | 0.030s
#
# Speedup: 292.3x with 0.0% RMSE difference

With just 50 inducing points (2.5% of the data), the sparse GP achieves identical accuracy at nearly 300× the speed. For even larger datasets, modern approaches like variational sparse GPs (Titsias 2009), random Fourier features, and GPyTorch can handle millions of points while preserving uncertainty estimates.

Try It: Kernel Explorer

Explore what different kernels “believe” about functions. The top shows 3 random function samples from the GP prior. The bottom shows the kernel matrix heatmap — brighter means higher similarity.

Length scale: 1.00

Signal var: 1.00

Period: 2.00

Where GPs Shine (and Where They Don’t)

Gaussian Processes are the ideal choice when you need calibrated uncertainty, have small-to-medium datasets, and the input dimensionality is moderate (typically <20). They’re the backbone of Bayesian optimization (where uncertainty guides exploration), active learning (where we query the most uncertain points), and scientific modeling (where honest error bars matter more than raw accuracy).

Their limitations are real: O(N³) scaling means exact GPs struggle beyond ~10,000 points, they assume smoothness dictated by the kernel choice, and they can’t easily learn hierarchical features like deep neural networks. But when the problem fits — and it does surprisingly often — nothing else gives you such elegant, well-calibrated predictions with built-in uncertainty quantification.

We’ve built the full pipeline: kernel functions that encode prior beliefs, GP priors that sample random functions, exact posterior inference via Cholesky decomposition, hyperparameter optimization via marginal likelihood, classification via the Laplace approximation, and sparse approximations that break the cubic bottleneck. That’s a complete GP toolkit, from scratch.

References & Further Reading

Carl Edward Rasmussen & Christopher K. I. Williams — “Gaussian Processes for Machine Learning” — The definitive GP textbook, freely available online (MIT Press, 2006)
David MacKay — “Information Theory, Inference and Learning Algorithms” — Chapter 45 gives a wonderful introduction to GPs (Cambridge University Press, 2003)
Michalis Titsias — “Variational Learning of Inducing Variables in Sparse Gaussian Processes” — The foundational paper on variational sparse GPs (AISTATS 2009)
Andrew Gordon Wilson & Hannes Nickisch — “Kernel Interpolation for Scalable Structured Gaussian Processes” — KISS-GP for scaling to large datasets (ICML 2015)
James Hensman, Nicoló Fusi & Neil D. Lawrence — “Gaussian Processes for Big Data” — Stochastic variational inference for massive GP models (UAI 2013)