← Back to Blog

Bayesian Inference from Scratch

Why Bayes? The Problem with Point Estimates

Your model says there's a 73% chance of rain tomorrow. How confident is it in that 73%? A model trained on 10,000 days of weather data and a model trained on 3 days both might output 73%, but your trust in them should be wildly different. Bayesian inference doesn't just give you answers — it tells you how much to trust those answers.

Most machine learning operates in a world of point estimates. Maximum Likelihood Estimation (MLE) finds the single set of parameters that makes the observed data most probable and calls it a day. This gives you one number, one line of best fit, one set of weights. No error bars, no confidence measure, no "I'm not sure about this" signal. A model fit to three data points looks just as certain as one fit to three million.

Bayesian inference fixes this by replacing point estimates with distributions. Instead of asking "what is the best value of θ?", it asks "given the data I've seen, what values of θ are plausible, and how plausible is each one?" The answer is the posterior distribution — a complete picture of your uncertainty about the parameter.

The machinery is Bayes' theorem, published posthumously by Thomas Bayes in 1763:

P(θ | data) = P(data | θ) × P(θ) / P(data)

Posterior = Likelihood × Prior / Evidence

Three ingredients combine into one output. The prior P(θ) encodes what you believed before seeing any data. The likelihood P(data | θ) tells you how probable the observed data would be under each possible value of θ. The evidence P(data) is a normalizing constant that ensures the posterior integrates to 1. Multiply prior by likelihood, normalize, and you have the posterior — your updated belief after seeing the data.

Let's build this from scratch. We'll start with the classic example that trips up almost everyone: the medical test.

import numpy as np

# The Medical Test Problem
# A disease affects 1% of the population
# Test sensitivity (true positive rate): 95%
# Test specificity (true negative rate): 90%
# You test positive. What's the probability you have the disease?

prevalence = 0.01       # P(disease)
sensitivity = 0.95      # P(positive | disease)
specificity = 0.90      # P(negative | no disease)

# Bayes' theorem
p_positive = sensitivity * prevalence + (1 - specificity) * (1 - prevalence)
p_disease_given_positive = (sensitivity * prevalence) / p_positive

print(f"P(disease | positive test) = {p_disease_given_positive:.3f}")
# Output: P(disease | positive test) = 0.088

# Only 8.8%! The base rate (1%) dominates. Most positives are false positives.

# Now let's do general Bayesian updating on a grid
# Problem: estimate the bias of a coin from observed flips

theta_grid = np.linspace(0, 1, 1000)   # 1000 possible bias values
prior = np.ones_like(theta_grid)        # uniform prior: all biases equally likely
prior /= prior.sum()                    # normalize

# Observe: H, H, T, H, T (3 heads, 2 tails)
flips = [1, 1, 0, 1, 0]

posterior = prior.copy()
for flip in flips:
    if flip == 1:  # heads
        likelihood = theta_grid         # P(H | theta) = theta
    else:           # tails
        likelihood = 1 - theta_grid     # P(T | theta) = 1 - theta
    posterior *= likelihood
    posterior /= posterior.sum()         # normalize after each update

map_estimate = theta_grid[np.argmax(posterior)]
mean_estimate = np.sum(theta_grid * posterior)
print(f"MAP estimate: {map_estimate:.3f}")   # ~0.600
print(f"Mean estimate: {mean_estimate:.3f}") # ~0.571

The medical test result shocks most people: a 95%-accurate test with a positive result only means an 8.8% chance of disease. The base rate (1% prevalence) dominates. This is exactly why Bayesian thinking matters — priors aren't optional, they're already baked into every inference you make, whether you acknowledge them or not.

The coin example shows the grid-based approach: discretize the parameter space, compute likelihood at each point, multiply by prior, normalize. After just 5 flips, the posterior already peaks near θ = 0.6 but remains broad — we're uncertain because we have little data. We used this same principle to build a Naive Bayes classifier — that post applied Bayes' theorem for classification, but here we're exploring the inference framework itself.

Conjugate Priors — When the Math Works Out

The grid approach works for one parameter, but it scales terribly. With 10 parameters, a grid of 100 points per dimension needs 10010 = 1020 evaluations. We need analytical solutions, and conjugate priors are the cases where the math cooperates beautifully.

A prior is conjugate to a likelihood if the posterior has the same functional form as the prior. The update just changes the prior's parameters. No integrals, no grids — just arithmetic.

Three conjugate pairs cover most of the cases you'll encounter:

import numpy as np

# === Beta-Binomial Conjugacy ===
# Prior: Beta(alpha, beta) — encodes belief about coin bias
# Likelihood: Binomial — counting heads and tails
# Posterior: Beta(alpha + heads, beta + tails)

def beta_binomial_update(alpha_prior, beta_prior, n_heads, n_tails):
    """Update Beta prior with binomial observations."""
    return alpha_prior + n_heads, beta_prior + n_tails

# Start with uniform prior Beta(1, 1)
alpha, bet = 1.0, 1.0

# Simulate observing a biased coin (true bias = 0.7)
np.random.seed(42)
data_sizes = [10, 100, 1000]
for n in data_sizes:
    flips = np.random.binomial(1, 0.7, n)
    a_post, b_post = beta_binomial_update(alpha, bet, flips.sum(), n - flips.sum())
    mean_post = a_post / (a_post + b_post)
    std_post = np.sqrt(a_post * b_post / ((a_post + b_post)**2 * (a_post + b_post + 1)))
    print(f"n={n:4d}: posterior Beta({a_post:.0f}, {b_post:.0f}), "
          f"mean={mean_post:.3f} +/- {std_post:.3f}")
# n=  10: posterior Beta(9, 4),    mean=0.692 +/- 0.124
# n= 100: posterior Beta(74, 29),  mean=0.718 +/- 0.044
# n=1000: posterior Beta(691, 311), mean=0.690 +/- 0.015

# === Normal-Normal Conjugacy ===
# Prior: N(mu_0, sigma_0^2) on the mean
# Likelihood: N(mu, sigma^2) with known variance
# Posterior: precision-weighted average

def normal_normal_update(mu_0, sigma_0, data, sigma_known):
    """Update Normal prior with Gaussian observations."""
    n = len(data)
    precision_prior = 1 / sigma_0**2
    precision_data = n / sigma_known**2
    precision_post = precision_prior + precision_data
    mu_post = (precision_prior * mu_0 + precision_data * np.mean(data)) / precision_post
    sigma_post = np.sqrt(1 / precision_post)
    return mu_post, sigma_post

data = np.random.normal(5.0, 2.0, 50)  # true mean = 5.0
mu_post, sigma_post = normal_normal_update(0.0, 10.0, data, 2.0)
print(f"\nNormal-Normal: posterior mean={mu_post:.2f}, std={sigma_post:.2f}")
# Posterior concentrates near 5.0, much tighter than prior std of 10.0

Watch what happens as data accumulates in the Beta-Binomial case: after 10 flips the posterior is broad (std = 0.124), after 100 it's narrower (std = 0.044), after 1000 it's a sharp spike (std = 0.015). With enough data, the prior doesn't matter — the posterior converges to the true parameter value regardless of where you started. This is the frequentist-Bayesian convergence theorem, and it's deeply reassuring: even if your prior is wrong, the data will eventually overwhelm it.

Here's the connection that unifies half this blog series: L2 regularization is a Gaussian prior on the weights. When you add a λ||w||² penalty to your loss function, you're saying "I believe the weights are small" — that's a zero-mean Gaussian prior. L1 regularization is a Laplace prior. Every regularized model in the series has been doing Bayesian inference in disguise.

MAP Estimation — The Bridge Between Two Worlds

If the full posterior feels like overkill, there's a pragmatic middle ground: Maximum A Posteriori (MAP) estimation. Instead of computing the entire posterior distribution, just find its peak:

θMAP = argmax P(data | θ) × P(θ)

Compare to MLE: θMLE = argmax P(data | θ)

The only difference is the prior term P(θ). Take logs, and MAP becomes: maximize log-likelihood + log-prior. If the prior is Gaussian (zero-mean, variance σ²), the log-prior is −λ||θ||² — literally L2 regularization. If the prior is Laplace, it's −λ||θ||1 — L1 regularization. MAP is the mathematical proof that regularization is a prior belief.

import numpy as np

# MAP vs MLE: Polynomial fitting with and without a prior
np.random.seed(42)

# Generate noisy data from a simple quadratic
x = np.linspace(-1, 1, 15)
y_true = 0.5 * x**2 - 0.3 * x + 0.1
y = y_true + np.random.normal(0, 0.15, len(x))

def fit_polynomial(x, y, degree, reg_lambda=0.0):
    """Fit polynomial with optional L2 regularization (= Gaussian prior)."""
    X = np.column_stack([x**d for d in range(degree + 1)])
    if reg_lambda > 0:
        # MAP with Gaussian prior: (X^T X + lambda I)^-1 X^T y
        w = np.linalg.solve(X.T @ X + reg_lambda * np.eye(degree + 1), X.T @ y)
    else:
        # MLE: (X^T X)^-1 X^T y
        w = np.linalg.lstsq(X, y, rcond=None)[0]
    return w

# Fit degree-10 polynomial — way too complex for 15 points
w_mle = fit_polynomial(x, y, degree=10, reg_lambda=0.0)
w_map = fit_polynomial(x, y, degree=10, reg_lambda=1.0)

# Evaluate on a fine grid
x_test = np.linspace(-1, 1, 200)
X_test = np.column_stack([x_test**d for d in range(11)])
y_truth = 0.5 * x_test**2 - 0.3 * x_test + 0.1

mle_mse = np.mean((X_test @ w_mle - y_truth)**2)
map_mse = np.mean((X_test @ w_map - y_truth)**2)

print(f"MLE test MSE: {mle_mse:.4f}")  # Overfits badly
print(f"MAP test MSE: {map_mse:.4f}")  # Smooth, close to truth

print(f"\nMLE max |weight|: {np.max(np.abs(w_mle)):.1f}")  # Huge weights
print(f"MAP max |weight|: {np.max(np.abs(w_map)):.1f}")   # Small weights

MLE with a degree-10 polynomial and only 15 data points gives wildly oscillating predictions — the weights explode as the model contorts to fit every noise point. MAP with a Gaussian prior (equivalent to λ = 1.0 ridge regression) keeps the weights small and the predictions smooth. The prior says "coefficients should be modest" and the data says "but this coefficient needs to be large to fit the curve" — the posterior is the compromise. This is the same tradeoff at work in regularized logistic regression and gradient boosting.

MCMC — When Conjugacy Fails

Conjugate priors and MAP estimation are elegant, but they cover a narrow slice of reality. Most interesting models — mixture models, hierarchical models, neural networks — don't have conjugate posteriors. The posterior P(θ | data) ∝ P(data | θ)P(θ) is known up to a normalizing constant, but computing that constant requires integrating over every possible θ. In high dimensions, this integral is intractable.

Markov Chain Monte Carlo (MCMC) solves this with a brilliant trick: generate samples from the posterior without ever computing the normalizing constant. The Metropolis-Hastings algorithm works like a random walk that prefers uphill moves on the posterior landscape:

  1. Start at some initial θ0
  2. Propose a new θ' by adding random noise to the current position
  3. Compute the acceptance ratio: α = P(data | θ')P(θ') / P(data | θ)P(θ)
  4. Accept the move with probability min(1, α); otherwise stay put
  5. Repeat thousands of times

The magic is in step 3: the normalizing constants P(data) cancel in the ratio, so you never need to compute them. After a burn-in period, the chain of visited positions approximates the posterior distribution. Regions of high posterior density get visited more often.

import numpy as np

def metropolis_hastings(log_posterior_fn, initial, n_samples, step_size):
    """General-purpose Metropolis-Hastings sampler."""
    current = np.array(initial, dtype=float)
    samples = np.zeros((n_samples, len(current)))
    log_p_current = log_posterior_fn(current)
    n_accepted = 0

    for i in range(n_samples):
        # Propose new position
        proposal = current + np.random.normal(0, step_size, size=len(current))
        log_p_proposal = log_posterior_fn(proposal)

        # Accept/reject
        log_ratio = log_p_proposal - log_p_current
        if np.log(np.random.random()) < log_ratio:
            current = proposal
            log_p_current = log_p_proposal
            n_accepted += 1

        samples[i] = current

    print(f"Acceptance rate: {n_accepted / n_samples:.2%}")
    return samples

# Problem: infer the mean and std of a Gaussian from data
np.random.seed(42)
true_mu, true_sigma = 3.0, 1.5
data = np.random.normal(true_mu, true_sigma, 50)

def log_posterior(params):
    """Log-posterior for Gaussian mean and log-std with weak priors."""
    mu, log_sigma = params
    sigma = np.exp(log_sigma)
    # Likelihood: product of Gaussian PDFs (in log space: sum)
    log_lik = -0.5 * np.sum(((data - mu) / sigma)**2) - len(data) * np.log(sigma)
    # Priors: N(0, 10) on mu, N(0, 2) on log_sigma
    log_prior = -0.5 * (mu / 10)**2 - 0.5 * (log_sigma / 2)**2
    return log_lik + log_prior

samples = metropolis_hastings(log_posterior, [0.0, 0.0], n_samples=10000, step_size=0.15)
# Acceptance rate: ~45%

# Discard burn-in (first 2000 samples)
posterior_samples = samples[2000:]
mu_samples = posterior_samples[:, 0]
sigma_samples = np.exp(posterior_samples[:, 1])

print(f"Posterior mean of mu:    {mu_samples.mean():.2f} +/- {mu_samples.std():.2f}")
print(f"Posterior mean of sigma: {sigma_samples.mean():.2f} +/- {sigma_samples.std():.2f}")
# mu ~ 3.0, sigma ~ 1.5 — recovers the true parameters!

We parametrize the standard deviation as log_sigma to keep it positive — a common trick. The sampler achieves ~45% acceptance rate, which is close to the theoretical optimum of 23-44% for Metropolis-Hastings. After discarding the burn-in period (where the chain wanders toward the high-density region), the samples accurately recover both the mean and standard deviation.

There's a deep connection to diffusion models here: Langevin dynamics (used in score-based diffusion) is essentially MCMC with gradients. Both are ways of sampling from complex, unnormalized distributions — MCMC with random proposals, Langevin with gradient-guided proposals.

Bayesian Linear Regression — The Full Treatment

Let's bring everything together with a complete worked example. Bayesian linear regression puts a prior on the weight vector and computes the full posterior analytically — it's Normal-Normal conjugacy in multivariate form.

Prior: w ~ N(0, σw² I)

Likelihood: y | X, w ~ N(Xw, σn² I)

Posterior: w | X, y ~ N(μpost, Σpost)

The posterior mean and covariance have closed-form solutions:

Σpost = (σn−2 XTX + σw−2 I)−1

μpost = σn−2 Σpost XT y
import numpy as np

def bayesian_linear_regression(X, y, sigma_noise, sigma_prior):
    """Compute posterior mean and covariance for Bayesian linear regression."""
    n_features = X.shape[1]
    precision_noise = 1 / sigma_noise**2
    precision_prior = 1 / sigma_prior**2

    # Posterior covariance: (sigma_n^-2 * X^T X + sigma_w^-2 * I)^-1
    Sigma_post = np.linalg.inv(
        precision_noise * X.T @ X + precision_prior * np.eye(n_features)
    )
    # Posterior mean: sigma_n^-2 * Sigma_post @ X^T @ y
    mu_post = precision_noise * Sigma_post @ X.T @ y

    return mu_post, Sigma_post

def predict_with_uncertainty(X_new, mu_post, Sigma_post, sigma_noise):
    """Predict with credible intervals."""
    y_pred = X_new @ mu_post
    # Predictive variance = model uncertainty + noise
    pred_var = np.array([
        x @ Sigma_post @ x + sigma_noise**2 for x in X_new
    ])
    return y_pred, np.sqrt(pred_var)

# Generate data from y = 1.5x + 0.5 + noise
np.random.seed(42)
x_train = np.sort(np.random.uniform(-2, 2, 20))
y_train = 1.5 * x_train + 0.5 + np.random.normal(0, 0.5, 20)

# Design matrix with bias term: [1, x]
X_train = np.column_stack([np.ones_like(x_train), x_train])

mu_post, Sigma_post = bayesian_linear_regression(X_train, y_train, 0.5, 2.0)
print(f"Posterior weights: intercept={mu_post[0]:.2f}, slope={mu_post[1]:.2f}")
# Close to true values: intercept=0.5, slope=1.5

# Predict on a fine grid including extrapolation
x_test = np.linspace(-4, 4, 200)
X_test = np.column_stack([np.ones_like(x_test), x_test])
y_pred, y_std = predict_with_uncertainty(X_test, mu_post, Sigma_post, 0.5)

# Three things MLE can't give you:
# 1. Credible intervals: y_pred +/- 2*y_std covers 95% of predictions
# 2. Growing uncertainty: y_std is small near data, large far away
# 3. The posterior itself: Sigma_post tells you weight correlations
in_data = (x_test > -2) & (x_test < 2)
out_data = (x_test < -3) | (x_test > 3)
print(f"Avg uncertainty in data range:    {y_std[in_data].mean():.3f}")
print(f"Avg uncertainty in extrapolation: {y_std[out_data].mean():.3f}")
# Uncertainty is much larger when extrapolating — Bayesian knows what it doesn't know

This is the payoff of the full Bayesian treatment. MLE gives you one line. Bayesian regression gives you a distribution of lines, and from that distribution you get credible intervals that widen where data is sparse. The model literally tells you "I'm confident here, but I'm guessing there." This connects to model evaluation — Bayesian model comparison via marginal likelihood automatically penalizes overly complex models (Occam's razor built into the math).

Bayesian Deep Learning — Where Theory Meets Scale

Everything so far works beautifully in low dimensions. But neural networks have millions of parameters. Running MCMC on a 7-billion-parameter language model would take longer than the heat death of the universe. Does Bayesian inference have anything to offer at this scale?

Yes — through approximate methods that trade exactness for tractability:

import numpy as np

class SimpleNeuralNet:
    """2-layer neural net with dropout for MC uncertainty estimation."""
    def __init__(self, input_dim, hidden_dim, seed=42):
        rng = np.random.RandomState(seed)
        self.W1 = rng.randn(input_dim, hidden_dim) * 0.5
        self.b1 = np.zeros(hidden_dim)
        self.W2 = rng.randn(hidden_dim, 1) * 0.5
        self.b2 = np.zeros(1)

    def forward(self, X, dropout_rate=0.0):
        """Forward pass with optional dropout."""
        h = np.maximum(0, X @ self.W1 + self.b1)  # ReLU
        if dropout_rate > 0:
            mask = np.random.binomial(1, 1 - dropout_rate, h.shape)
            h = h * mask / (1 - dropout_rate)      # inverted dropout
        return h @ self.W2 + self.b2

    def mc_predict(self, X, n_forward=50, dropout_rate=0.3):
        """MC Dropout: run multiple forward passes, measure variance."""
        predictions = np.array([self.forward(X, dropout_rate) for _ in range(n_forward)])
        mean_pred = predictions.mean(axis=0)
        std_pred = predictions.std(axis=0)
        return mean_pred.flatten(), std_pred.flatten()

# Demo: uncertainty should be high where there's no training data
net = SimpleNeuralNet(1, 32)
x_test = np.linspace(-5, 5, 100).reshape(-1, 1)
mean_pred, std_pred = net.mc_predict(x_test, n_forward=100, dropout_rate=0.3)

# Show uncertainty varies across the input space
for region, mask in [("center", np.abs(x_test.flatten()) < 1),
                     ("edges", np.abs(x_test.flatten()) > 3)]:
    print(f"Avg uncertainty ({region}): {std_pred[mask].mean():.3f}")

Even without training (using random weights), MC Dropout produces varying uncertainty across the input space — some regions activate more dropout-sensitive pathways than others. In a trained network, this effect is much more pronounced: regions near training data produce consistent predictions (low variance), while regions far from training data produce scattered predictions (high variance). Dropout isn't just a regularizer — it's an approximate Bayesian inference engine.

The connection to knowledge distillation runs deep: ensemble distillation compresses the uncertainty information from multiple models into a single network. And model merging — averaging the weights of multiple trained models — can be viewed as a crude approximation to averaging over posterior modes.

Try It: Bayesian Belief Updater

Set a prior belief about a coin's bias, then flip the coin to see how the posterior updates. The true bias is hidden — can you infer it?

2
2
Click "Flip" to start observing data and updating your posterior.

Try It: MCMC Explorer

Watch a Metropolis-Hastings chain explore a bimodal target distribution. Adjust step size to see how it affects exploration.

0.30
Click "Run" to start the MCMC chain. Try different step sizes to find the sweet spot.

References & Further Reading

Posts in this series: Naive Bayes, Regularization, Logistic Regression, Information Theory, Diffusion Models, ML Evaluation