← Back to Blog

Monte Carlo Methods from Scratch: Random Sampling as a Superpower

The Monte Carlo Idea — Randomness as Computation

In 1946, mathematician Stanislaw Ulam was recovering from brain surgery with nothing to do but play solitaire. After dozens of games, he found himself wondering: what's the probability of a particular layout winning? He tried computing it analytically — combinatorics, conditional probabilities, the works — and gave up. Then he had an insight that would reshape computational science: why not just play a hundred games and count the wins?

This “just simulate it” philosophy became the Monte Carlo method, named after the famous casino in Monaco (where, legend has it, Ulam's uncle had a gambling habit). The core idea is beautifully simple: if you can't compute something analytically, approximate it with random sampling.

More formally: to estimate the expected value E[f(X)], draw N random samples x₁, x₂, …, xₙ from the distribution of X and compute the average:

E[f(X)] ≈ (1/N) ∑ f(xᵢ)

By the law of large numbers, this estimate converges to the true value as N grows. The convergence rate is O(1/√N) — meaning each factor of 4× more samples halves your error. And here's the magical part: this rate holds regardless of dimensionality. Whether you're estimating a 1D integral or a 1000D integral, the convergence rate stays the same. This is why Monte Carlo methods dominate high-dimensional computation.

If you've read our posts on Bayesian inference or reinforcement learning, you've already encountered Monte Carlo methods in action — MCMC for posterior sampling, MC returns for policy evaluation. This post builds those foundations from scratch.

Let's start with the most famous Monte Carlo experiment: estimating π.

import numpy as np

np.random.seed(42)

for n in [100, 1_000, 10_000, 100_000]:
    x = np.random.uniform(0, 1, n)
    y = np.random.uniform(0, 1, n)
    inside = np.sum(x**2 + y**2 <= 1.0)
    pi_hat = 4.0 * inside / n
    error = abs(pi_hat - np.pi)
    print(f"N = {n:,}  π ≈ {pi_hat:.6f}  |error| = {error:.6f}")

The idea: throw random darts at a unit square. A quarter-circle of radius 1 has area π/4, so the fraction of darts landing inside the circle times 4 estimates π. Run this and you'll see the error roughly halving each time N quadruples — the O(1/√N) convergence rate at work.

Monte Carlo Integration — Taming Intractable Integrals

Estimating π is fun, but the real power of Monte Carlo is integration. Many problems in machine learning boil down to computing integrals that have no closed-form solution: posterior distributions in Bayesian inference, expected rewards in reinforcement learning, partition functions in energy-based models.

To compute ∫f(x)dx over [a, b], draw N uniform samples and average:

ab f(x)dx ≈ (b − a)/N · ∑ f(xᵢ)

This extends to multiple dimensions trivially — just draw samples uniformly in the d-dimensional box. And here's where Monte Carlo truly shines: traditional quadrature methods (like Simpson's rule) use a grid of N points per dimension, requiring Nd total evaluations. In 10 dimensions with 100 points per axis, that's 10010 = 1020 evaluations. Monte Carlo? Still just N evaluations, regardless of d. This is the famous curse of dimensionality for grid methods, and Monte Carlo's immunity to it.

import numpy as np

np.random.seed(42)

def mc_integrate(f, a, b, n_samples):
    """Monte Carlo integration of f over [a, b]."""
    x = np.random.uniform(a, b, n_samples)
    return (b - a) * np.mean(f(x))

# 1D: integral of e^(-x^2) from 0 to 1
estimate_1d = mc_integrate(lambda x: np.exp(-x**2), 0, 1, 100_000)
true_1d = 0.7468  # sqrt(pi)/2 * erf(1)
print(f"1D integral: {estimate_1d:.4f} (true: {true_1d:.4f})")

# 2D: integral of sin(x)*cos(y) over [0, pi] x [0, pi]
n = 100_000
xy = np.random.uniform(0, np.pi, (n, 2))
estimate_2d = (np.pi ** 2) * np.mean(np.sin(xy[:, 0]) * np.cos(xy[:, 1]))
print(f"2D integral: {estimate_2d:.4f} (true: 0.0000)")

# 10D: product of sin(xi) over [0, pi]^10
samples = np.random.uniform(0, np.pi, (500_000, 10))
integrand = np.prod(np.sin(samples), axis=1)
estimate_10d = (np.pi ** 10) * np.mean(integrand)
true_10d = 2**10  # each integral of sin(x) over [0, pi] = 2
print(f"10D integral: {estimate_10d:.1f} (true: {true_10d})")

The 10D integral uses 500K random samples where a grid approach would need trillions of points. This dimension-independence is why Monte Carlo is the backbone of modern probabilistic ML.

Try It: Monte Carlo Playground

Loading...

Importance Sampling — Smarter Sampling

Basic Monte Carlo draws samples uniformly, but this can be wildly inefficient. Imagine estimating the probability of a rare event — say, P(X > 4) where X is a standard normal variable. This probability is about 3 × 10-5. With naive sampling, you'd need millions of draws just to get a handful of hits, and your estimate would be noisy.

Importance sampling fixes this by drawing from a smarter proposal distribution q(x) that concentrates samples where they matter. The key identity:

Ep[f(X)] = Eq[f(X) · p(X) / q(X)]

Draw from q instead of p, and multiply each sample by the correction factor p(x)/q(x) to account for the different sampling distribution. If q concentrates samples in the important region (where f(x) · p(x) is large), the variance drops dramatically.

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Goal: estimate P(X > 4) where X ~ N(0, 1)
true_prob = 1 - norm.cdf(4)  # about 3.17e-5

# Naive Monte Carlo: draw from N(0, 1), count exceedances
n_naive = 100_000
samples_naive = np.random.normal(0, 1, n_naive)
naive_estimate = np.mean(samples_naive > 4)
naive_stderr = np.std(samples_naive > 4) / np.sqrt(n_naive)
print(f"True P(X > 4):       {true_prob:.2e}")
print(f"Naive MC (N=100K):    {naive_estimate:.2e} +/- {naive_stderr:.2e}")

# Importance sampling: draw from q = N(4, 1), reweight by p/q
n_is = 1_000
samples_is = np.random.normal(4, 1, n_is)
weights = (samples_is > 4) * norm.pdf(samples_is, 0, 1) / norm.pdf(samples_is, 4, 1)
is_estimate = np.mean(weights)
is_stderr = np.std(weights) / np.sqrt(n_is)
print(f"Importance Sampling (N=1K): {is_estimate:.2e} +/- {is_stderr:.2e}")

The importance sampling estimator uses 100× fewer samples than naive MC yet gives a far more precise estimate. By centering our proposal at 4 (right where the rare event lives), nearly every sample contributes meaningful information. This same technique powers off-policy evaluation in reinforcement learning, where we reweight trajectories from one policy to evaluate another.

Rejection Sampling — Sampling from Awkward Distributions

Sometimes you need actual samples from a distribution p(x), not just an expected value. What if p(x) is some exotic density you can evaluate point-by-point but can't directly sample from?

Rejection sampling is the classic solution: pick a proposal distribution q(x) that you can sample from, find a constant M such that M · q(x) ≥ p(x) everywhere, then repeatedly draw x from q and accept with probability p(x)/(M · q(x)). Accepted samples are exactly distributed according to p.

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Target: mixture of two Gaussians
def target_pdf(x):
    return 0.3 * norm.pdf(x, -2, 0.7) + 0.7 * norm.pdf(x, 3, 1.2)

# Proposal: single broad Gaussian that covers both modes
def proposal_pdf(x):
    return norm.pdf(x, 1, 3)

# Find M such that M * q(x) >= p(x) everywhere
x_grid = np.linspace(-6, 8, 10000)
M = np.max(target_pdf(x_grid) / proposal_pdf(x_grid)) * 1.01

# Rejection sampling
n_proposals = 100_000
proposals = np.random.normal(1, 3, n_proposals)
u = np.random.uniform(0, 1, n_proposals)
accept_prob = target_pdf(proposals) / (M * proposal_pdf(proposals))
accepted = proposals[u <= accept_prob]

acceptance_rate = len(accepted) / n_proposals
print(f"Envelope constant M: {M:.2f}")
print(f"Proposed: {n_proposals:,}  Accepted: {len(accepted):,}")
print(f"Acceptance rate: {acceptance_rate:.1%}")

Rejection sampling works beautifully in low dimensions. But here's the catch: in high dimensions, the acceptance rate drops exponentially. The volume of the proposal envelope grows much faster than the volume of the target distribution — the same curse of dimensionality, striking again. In 10D you might accept one sample per million proposals. If you've read about speculative decoding, you've seen rejection sampling at work in a very different context — accepting or rejecting draft tokens from a smaller model.

We need a fundamentally different approach for high-dimensional problems. Enter MCMC.

Markov Chain Monte Carlo — The Workhorse of Bayesian ML

MCMC takes a completely different approach to the dimensionality problem. Instead of generating independent samples and filtering (rejection sampling), it constructs a Markov chain — a sequence of correlated samples where each depends on the previous one — whose long-run distribution is exactly the target distribution. You trade independence for efficiency: the samples are correlated, but you can actually get them in high dimensions.

The most famous MCMC algorithm is Metropolis-Hastings (MH), published in its original form by Nicholas Metropolis and colleagues in 1953 for simulating atomic nuclei, and generalized by W.K. Hastings in 1970. Here's how it works:

  1. Start at some initial position x₀
  2. Propose a new position x′ by making a random jump from the current position
  3. Compute the acceptance ratio α = p(x′) / p(xcurrent)
  4. Accept the proposal with probability min(1, α). If rejected, stay put.
  5. Repeat

The genius of step 3: you only need the ratio p(x′)/p(x), which means the normalizing constant cancels out. This is exactly the situation in Bayesian inference — the posterior is proportional to likelihood × prior, but the normalizing constant (the evidence) is an intractable integral. MCMC sidesteps it entirely.

The proposal mechanism is typically a Gaussian random walk: x′ = xcurrent + ε where ε ∼ N(0, σ²). The step size σ controls the explore-exploit tradeoff. Too small and the chain crawls through the space with high autocorrelation. Too large and most proposals land in low-density regions and get rejected. Roberts et al. (1997) showed that the optimal acceptance rate is roughly 23.4% for high-dimensional targets.

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Target: bimodal distribution (known only up to normalizing constant)
def log_target(x):
    return np.log(0.3 * norm.pdf(x, -2, 0.5) + 0.7 * norm.pdf(x, 3, 1.0))

# Metropolis-Hastings with Gaussian random walk
n_samples = 10_000
step_size = 1.5
chain = np.zeros(n_samples)
chain[0] = 0.0
accepted = 0

for i in range(1, n_samples):
    proposal = chain[i-1] + np.random.normal(0, step_size)
    log_alpha = log_target(proposal) - log_target(chain[i-1])

    if np.log(np.random.uniform()) < log_alpha:
        chain[i] = proposal
        accepted += 1
    else:
        chain[i] = chain[i-1]

burn_in = 1000
samples = chain[burn_in:]
print(f"Acceptance rate: {accepted / n_samples:.1%}")
print(f"Sample mean: {np.mean(samples):.3f}")
print(f"Sample std:  {np.std(samples):.3f}")

The first 1000 samples (the “burn-in” period) are discarded because the chain needs time to reach its stationary distribution. A trace plot of the remaining samples would show the chain bouncing between the two modes at x = −2 and x = 3, spending roughly 30% and 70% of its time in each — matching the target mixture weights.

The samples are correlated, so the effective sample size (ESS) is less than the actual chain length. With a step size of 1.5, you might get an ESS of ~2000 from 9000 post-burn-in samples. Tuning the step size to hit that ~23% acceptance sweet spot maximizes ESS.

In the interactive demo below, you can watch a 2D Metropolis-Hastings chain explore a bimodal landscape and experience the step-size tradeoff firsthand.

Try It: MCMC Visualizer

Click Auto to start or Step to advance one sample at a time.

MCMC for Bayesian Posterior Estimation

The killer application of MCMC is Bayesian inference. Given data and a model, we want to compute the posterior distribution over parameters. The posterior is proportional to likelihood × prior, but the normalizing constant requires integrating over the entire parameter space — exactly the kind of intractable integral that MCMC was born to handle.

Let's apply Metropolis-Hastings to a concrete problem: estimating the slope and intercept of a linear regression model with Bayesian priors.

import numpy as np

np.random.seed(42)

# Generate synthetic data: y = 2.5x - 1.0 + noise
n_data = 50
x_data = np.random.uniform(-2, 2, n_data)
y_data = 2.5 * x_data - 1.0 + np.random.normal(0, 0.8, n_data)

sigma = 0.8  # known noise level

def log_posterior(a, b):
    # Priors: a ~ N(0, 10), b ~ N(0, 10)
    log_prior = -0.5 * (a**2 / 100 + b**2 / 100)
    # Likelihood: y_i ~ N(a*x_i + b, sigma^2)
    residuals = y_data - (a * x_data + b)
    log_lik = -0.5 * np.sum(residuals**2) / sigma**2
    return log_prior + log_lik

# 2D Metropolis-Hastings
n_samples = 20_000
step_size = 0.1
chain = np.zeros((n_samples, 2))
chain[0] = [0.0, 0.0]
accepted = 0

for i in range(1, n_samples):
    proposal = chain[i-1] + np.random.normal(0, step_size, 2)
    log_alpha = log_posterior(*proposal) - log_posterior(*chain[i-1])

    if np.log(np.random.uniform()) < log_alpha:
        chain[i] = proposal
        accepted += 1
    else:
        chain[i] = chain[i-1]

burn_in = 2000
posterior = chain[burn_in:]
a_samples, b_samples = posterior[:, 0], posterior[:, 1]

print(f"Acceptance rate: {accepted / n_samples:.1%}")
print(f"Slope a:     {np.mean(a_samples):.3f} +/- {np.std(a_samples):.3f}  (true: 2.500)")
print(f"Intercept b: {np.mean(b_samples):.3f} +/- {np.std(b_samples):.3f}  (true: -1.000)")

# 95% credible intervals
a_ci = np.percentile(a_samples, [2.5, 97.5])
b_ci = np.percentile(b_samples, [2.5, 97.5])
print(f"95% CI for a: [{a_ci[0]:.3f}, {a_ci[1]:.3f}]")
print(f"95% CI for b: [{b_ci[0]:.3f}, {b_ci[1]:.3f}]")

The MCMC posterior correctly recovers the true parameters (a = 2.5, b = −1.0) with credible intervals that reflect our uncertainty given the data and noise level. A scatter plot of the 2D posterior samples would show a compact ellipse centered near the true values — the shape tells you about parameter correlations.

The Metropolis-Hastings algorithm we've built here is the conceptual foundation, but modern Bayesian tools go further. Hamiltonian Monte Carlo (HMC) uses gradient information to make large, efficient jumps through the posterior — it's the engine behind Stan, PyMC, and NumPyro. The No-U-Turn Sampler (NUTS) by Hoffman & Gelman automatically tunes HMC's trajectory length, making it practical for models with thousands of parameters. These are sophisticated variations on the same core idea: construct a Markov chain that explores the posterior.

The Monte Carlo journey: We started with darts estimating π, generalized to integration, learned to focus samples with importance sampling, discovered how to sample from arbitrary distributions with rejection sampling, then broke through the dimensionality barrier with MCMC. Each technique solves a limitation of the previous one — a beautiful progression from simple to powerful.

References & Further Reading

Related DadOps posts: Bayesian Inference from Scratch (MCMC for posteriors), Reinforcement Learning from Scratch (MC returns), Diffusion Models from Scratch (Langevin dynamics connection), Speculative Decoding from Scratch (rejection sampling for tokens), Bayesian Optimization from Scratch (GP-based sequential optimization)