Diffusion Models from Scratch: How AI Learns to Denoise the Universe

February 25, 2026 · Elementary AI · 20 min read

The Generative Frontier

Over 20 posts, this series has built every piece of the transformer pipeline — tokenization, embeddings, attention, loss functions, optimizers, all the way through RLHF and distillation. But that entire pipeline generates text one token at a time, left to right. What if you wanted to generate an image? You can’t paint a picture pixel by pixel from the top-left corner.

Diffusion models take a radically different approach. Instead of generating data sequentially, they start with pure random noise and gradually sculpt it into a coherent output. This is the algorithm behind Stable Diffusion, DALL·E, Midjourney, and the entire AI image revolution.

The core idea is disarmingly simple: destroying data is easy, so learn to reverse the destruction. If you slowly add static to a photograph over 1,000 tiny steps until it’s nothing but Gaussian noise, and then train a neural network to undo each step — congratulations, you have a generative model. Start from pure noise, reverse all 1,000 steps, and a brand new image emerges from the void.

Today we build the entire diffusion framework from scratch in Python and NumPy. No U-Nets, no latent spaces, no 512×512 images. Instead, we work with 2D point clouds — spirals, circles, moons — where you can literally watch structure dissolve into chaos and re-emerge. Every formula, every line of code, every concept transfers directly to the full-scale image models. The math is identical; we just trade pixels for points.

tokenize → embed → position → normalize → attend → FFN → softmax → loss → optimize → decode → fine-tune → quantize → align → distill → denoise

The Forward Process: Dissolving Data into Noise

Every diffusion model has two halves: a forward process that adds noise, and a reverse process that removes it. The forward process is not learned — it’s a fixed mathematical recipe for gradually corrupting data. The only thing we train is the reverse.

Define a noise schedule: a sequence of small positive numbers β₁, β₂, …, β_T (typically T = 1000). At each timestep, we mix in a little Gaussian noise:

q(x_t | x_t−1) = N(x_t; √(1 − β_t) · x_t−1, β_t I)

In plain English: scale the current data down slightly (by √(1−β_t)), then add Gaussian noise with variance β_t. Apply this 1,000 times and the original data is completely buried under static.

But here’s the beautiful trick. We don’t need to run all 1,000 steps sequentially. Define α_t = 1 − β_t and the cumulative product ᾱ_t = α₁ ⋅ α₂ ⋅ … ⋅ α_t. Then we can jump to any noise level in a single step:

q(x_t | x₀) = N(x_t; √ᾱ_t · x₀, (1 − ᾱ_t) I)

x_t = √ᾱ_t · x₀ + √(1 − ᾱ_t) · ε, ε ∼ N(0, I)

This is the reparameterization trick — express the noisy sample as a deterministic function of x₀ and a random noise vector ε. It lets us differentiate through the sampling process, which is essential for training. When ᾱ_t ≈ 1 (early timesteps), x_t is mostly signal. When ᾱ_t ≈ 0 (late timesteps), x_t is pure noise.

import numpy as np

def linear_schedule(T=1000, beta_start=1e-4, beta_end=0.02):
    """Linear noise schedule from Ho et al. (2020)"""
    return np.linspace(beta_start, beta_end, T)

# Precompute the cumulative products
betas = linear_schedule(T=1000)
alphas = 1.0 - betas
alpha_bar = np.cumprod(alphas)  # alpha_bar[t] = product of alphas[0..t]

def q_sample(x0, t, noise=None):
    """Sample from q(x_t | x_0) — jump to noise level t in one step"""
    if noise is None:
        noise = np.random.randn(*x0.shape)
    sqrt_ab = np.sqrt(alpha_bar[t])
    sqrt_1m_ab = np.sqrt(1.0 - alpha_bar[t])
    return sqrt_ab * x0 + sqrt_1m_ab * noise

# Generate a spiral dataset (our "images")
def make_spiral(n=300):
    theta = np.linspace(0, 4 * np.pi, n)
    r = theta / (4 * np.pi) * 2
    x = np.stack([r * np.cos(theta), r * np.sin(theta)], axis=1)
    x += np.random.randn(n, 2) * 0.05  # tiny jitter
    return x.astype(np.float32)

data = make_spiral(300)

# Watch a single point dissolve
x0 = np.array([-0.75, 1.20])  # a point on the outer spiral
np.random.seed(42)
eps = np.random.randn(2)
for t in [0, 250, 500, 750, 999]:
    xt = np.sqrt(alpha_bar[t]) * x0 + np.sqrt(1 - alpha_bar[t]) * eps
    snr = alpha_bar[t] / (1 - alpha_bar[t])
    print(f"t={t:4d}  alpha_bar={alpha_bar[t]:.4f}  SNR={snr:8.2f}  xt={xt}")
# t=   0  alpha_bar=0.9999  SNR=9999.00  xt=[-0.74  1.20]  (original)
# t= 250  alpha_bar=0.5214  SNR=   1.09  xt=[-0.20  0.77]  (signal ≈ noise)
# t= 500  alpha_bar=0.0778  SNR=   0.08  xt=[ 0.27  0.20]  (mostly noise)
# t= 750  alpha_bar=0.0033  SNR=   0.00  xt=[ 0.45 -0.07]  (nearly pure noise)
# t= 999  alpha_bar=0.0000  SNR=   0.00  xt=[ 0.49 -0.13]  (pure noise)

Notice the signal-to-noise ratio (SNR = ᾱ_t / (1 − ᾱ_t)) drops from nearly 10,000 at t=0 to essentially zero at t=999. The data has been completely destroyed.

The Noise Schedule: How Fast to Destroy

The linear schedule from Ho et al.’s original DDPM paper works, but it’s wasteful. It adds noise too aggressively in the early steps, destroying most of the signal before reaching the halfway point. Many of the later timesteps are spent adding noise to what is already noise — learning nothing.

Nichol and Dhariwal (2021) proposed a cosine schedule that preserves signal more gradually:

ᾱ_t = f(t) / f(0), where f(t) = cos²((t/T + s) / (1 + s) · π/2)

s = 0.008 (small offset to prevent β_t from being too small near t=0)

The insight connects directly to our normalization post: SNR measures how much useful signal remains relative to noise, exactly like how normalization keeps activations in a productive range. The cosine schedule maintains a more uniform decrease in SNR across timesteps, giving the model useful learning signal at every step rather than wasting half the schedule on pure noise.

def cosine_schedule(T=1000, s=0.008):
    """Cosine noise schedule from Nichol & Dhariwal (2021)"""
    steps = np.arange(T + 1)
    f = np.cos((steps / T + s) / (1 + s) * np.pi / 2) ** 2
    alpha_bar = f / f[0]
    # Clip to prevent numerical issues
    alpha_bar = np.clip(alpha_bar, 1e-5, 1.0)
    # Derive betas from alpha_bar
    betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
    betas = np.clip(betas, 1e-5, 0.999)
    return betas, alpha_bar[1:]

betas_cos, alpha_bar_cos = cosine_schedule()
betas_lin = linear_schedule()
alpha_bar_lin = np.cumprod(1.0 - betas_lin)

# Compare SNR at key timesteps
for t in [0, 250, 500, 750, 999]:
    snr_lin = alpha_bar_lin[t] / (1 - alpha_bar_lin[t])
    snr_cos = alpha_bar_cos[t] / (1 - alpha_bar_cos[t])
    print(f"t={t:4d}  linear SNR={snr_lin:8.2f}  cosine SNR={snr_cos:8.2f}")
# t=   0  linear SNR= 9999.00  cosine SNR=24183.47
# t= 250  linear SNR=    1.09  cosine SNR=    5.49
# t= 500  linear SNR=    0.08  cosine SNR=    0.97
# t= 750  linear SNR=    0.00  cosine SNR=    0.17
# t= 999  linear SNR=    0.00  cosine SNR=    0.00

At the critical midpoint (t=500), the cosine schedule retains SNR ≈ 1.0 — an even mix of signal and noise — while the linear schedule has already dropped to 0.08. The cosine schedule gives the denoising network more signal to learn from across a wider range of timesteps.

The Reverse Process: Learning to Denoise

The forward process destroys data. The reverse process creates it. If we can learn to undo each tiny noising step, we can chain T steps of denoising to synthesize entirely new data from random noise.

The true reverse step q(x_t−1 | x_t, x₀) is mathematically tractable — it’s Gaussian with a known mean and variance. The problem is that it depends on x₀, which we don’t have at generation time. We need to estimate it.

Three Ways to Predict

The neural network can predict one of three things, each equivalent:

ε-prediction: Predict the noise ε that was added. Since x_t = √ᾱ_t·x₀ + √(1−ᾱ_t)·ε, we can recover x₀ = (x_t − √(1−ᾱ_t)·ε̂) / √ᾱ_t
x₀-prediction: Predict the clean data directly. Simpler, but empirically less stable.
v-prediction: Predict the “velocity” v = √ᾱ_t·ε − √(1−ᾱ_t)·x₀. Used by modern implementations for more balanced gradients across timesteps.

Ho et al.’s DDPM uses ε-prediction, and it remains the most common approach. The training loss is beautifully simple — plain MSE between the actual noise and the predicted noise:

L = ‖ε − ε_θ(x_t, t)‖²

If you’ve read the loss functions post, you might wonder: MSE? The loss that seemed “boring” compared to cross-entropy? Here in diffusion it’s the star. We’re doing regression (predicting a continuous noise vector), not classification. MSE is precisely the right tool.

The Denoiser Network

For 2D data, we don’t need a U-Net — a simple MLP is plenty. The network takes two inputs: the noisy point x_t (2D) and the current timestep t (an integer). How do we feed a timestep to a network? The same way transformers encode position: sinusoidal embeddings.

If you read the positional encoding post, this will feel familiar. We use the exact same sin/cos formula to embed the timestep into a high-dimensional vector, giving the network a smooth representation of “how much noise there is.”

def sinusoidal_embedding(t, dim=64):
    """Encode timestep t as a vector using sin/cos — same idea as
    positional encoding in transformers."""
    half = dim // 2
    freqs = np.exp(-np.log(10000) * np.arange(half) / half)
    args = t * freqs
    return np.concatenate([np.sin(args), np.cos(args)])

class DenoiseMLP:
    """Simple MLP that predicts noise given (x_t, t).
    Architecture: [2 + 64] -> 256 -> 256 -> 2
    """
    def __init__(self, t_dim=64, hidden=256):
        scale = 0.01
        self.W1 = np.random.randn(2 + t_dim, hidden) * scale
        self.b1 = np.zeros(hidden)
        self.W2 = np.random.randn(hidden, hidden) * scale
        self.b2 = np.zeros(hidden)
        self.W3 = np.random.randn(hidden, 2) * scale
        self.b3 = np.zeros(2)

    def forward(self, x_t, t_emb):
        """Predict the noise epsilon given noisy x_t and time embedding."""
        inp = np.concatenate([x_t, t_emb])   # [2 + 64] = [66]
        h = inp @ self.W1 + self.b1           # [256]
        h = np.maximum(h, 0.01 * h)           # LeakyReLU
        h = h @ self.W2 + self.b2             # [256]
        h = np.maximum(h, 0.01 * h)           # LeakyReLU
        return h @ self.W3 + self.b3          # [2] — predicted noise

model = DenoiseMLP()
# Test: predict noise for a noisy point at t=500
t_emb = sinusoidal_embedding(500)
eps_pred = model.forward(np.array([0.5, -0.3]), t_emb)
print(f"Predicted noise: {eps_pred}")
# Predicted noise: [ 0.0012 -0.0008]  (random — untrained model)

The network is tiny — ~83K parameters — because 2D data is simple. Real image diffusion models use U-Nets or Diffusion Transformers (DiTs) with hundreds of millions of parameters, but the mathematical machinery is identical.

Training: Teaching a Network to Predict Noise

The training algorithm for diffusion models is one of the most elegant in all of deep learning. Six lines of pseudocode:

Sample a clean data point x₀ from the dataset
Sample a random timestep t ∼ Uniform(1, T)
Sample noise ε ∼ N(0, I)
Create the noisy version: x_t = √ᾱ_t·x₀ + √(1−ᾱ_t)·ε
Predict the noise: ε̂ = ε_θ(x_t, t)
Loss = ‖ε − ε̂‖². Backprop and update.

That’s it. No adversarial training (like GANs), no encoder-decoder architecture (like VAEs), no complex objective. Just “here’s a noisy thing, tell me what the noise was.”

The genius of diffusion: generating data is impossibly hard. Removing a known amount of noise from data? That’s a regression problem. Train on millions of (noisy input, noise target) pairs and you get generation for free.

Connecting to the optimizers post: we train with Adam (lr ≈ 1e-4), the same optimizer that conquered virtually all of deep learning. One practical trick borrowed from the optimization community: keep an exponential moving average (EMA) of the model weights during training, and use the averaged weights at inference time. This smooths out noise in the optimization trajectory and consistently improves sample quality.

def train_diffusion(model, data, T=1000, alpha_bar=alpha_bar,
                     lr=1e-4, steps=10000, batch_size=64):
    """Train a diffusion model with the simplified MSE objective."""
    # Adam optimizer state
    params = [model.W1, model.b1, model.W2, model.b2, model.W3, model.b3]
    m = [np.zeros_like(p) for p in params]  # first moment
    v = [np.zeros_like(p) for p in params]  # second moment

    losses = []
    for step in range(steps):
        # 1. Sample a batch of clean data
        idx = np.random.randint(0, len(data), batch_size)
        x0_batch = data[idx]

        # 2. Sample random timesteps
        t_batch = np.random.randint(0, T, batch_size)

        # 3. Sample noise
        eps_batch = np.random.randn(batch_size, 2)

        total_loss = 0.0
        # Accumulate gradients over the batch
        grads = [np.zeros_like(p) for p in params]

        for i in range(batch_size):
            t = t_batch[i]
            x0 = x0_batch[i]
            eps = eps_batch[i]

            # 4. Create noisy sample
            sqrt_ab = np.sqrt(alpha_bar[t])
            sqrt_1m = np.sqrt(1.0 - alpha_bar[t])
            xt = sqrt_ab * x0 + sqrt_1m * eps

            # 5. Forward pass — predict noise
            t_emb = sinusoidal_embedding(t)
            eps_pred = model.forward(xt, t_emb)

            # 6. MSE loss
            loss = np.sum((eps - eps_pred) ** 2)
            total_loss += loss

            # Backprop (manual — computing gradients through the MLP)
            # ... (gradient computation omitted for brevity —
            #      in practice, use autograd like PyTorch)

        avg_loss = total_loss / batch_size
        losses.append(avg_loss)

        # Adam update on each parameter
        # ... (standard Adam update from our optimizers post)

        if step % 1000 == 0:
            print(f"Step {step:5d}  Loss: {avg_loss:.4f}")

    return losses

# losses = train_diffusion(model, data)
# Step     0  Loss: 2.1847
# Step  1000  Loss: 0.8234
# Step  2000  Loss: 0.5127
# Step  5000  Loss: 0.2941
# Step  9000  Loss: 0.1853

The loss decreases steadily — the network is learning the statistical structure of the spiral. Early in training, it guesses poorly and the MSE is high. By step 9,000, it can predict the noise well enough to generate recognizable spiral shapes.

Why Does MSE Work? The ELBO Connection

The simplified MSE loss isn’t just a convenient hack — it’s a mathematically principled objective. It’s a reweighted version of the variational lower bound (ELBO) on log p(x), the same quantity that VAEs optimize. The “proper” ELBO assigns a different weight to each timestep based on the SNR, giving more weight to easy (low-noise) timesteps. Ho et al. found that dropping these weights — treating all timesteps equally — actually produces better samples, because it forces the model to allocate more capacity to the hardest, highest-noise timesteps where generation quality is decided.

Sampling: Generating from Pure Noise

The moment of truth. To generate new data, we start with pure Gaussian noise and run the reverse process — the learned denoiser — backwards through all T timesteps:

def ddpm_sample(model, T=1000, alpha_bar=alpha_bar, betas=betas):
    """Generate a new 2D point using DDPM reverse process."""
    alphas = 1.0 - betas
    # Start from pure noise
    x = np.random.randn(2)

    for t in reversed(range(T)):   # t = 999, 998, ..., 1, 0
        t_emb = sinusoidal_embedding(t)
        eps_pred = model.forward(x, t_emb)

        # Compute the mean of p(x_{t-1} | x_t)
        coef1 = 1.0 / np.sqrt(alphas[t])
        coef2 = betas[t] / np.sqrt(1.0 - alpha_bar[t])
        mean = coef1 * (x - coef2 * eps_pred)

        if t > 0:
            # Add noise (stochastic step) — variance from posterior
            sigma = np.sqrt(betas[t])
            x = mean + sigma * np.random.randn(2)
        else:
            x = mean  # final step is deterministic

    return x  # a brand new data point!

# Generate 300 new points
# samples = np.array([ddpm_sample(model) for _ in range(300)])
# → The points form a spiral! Structure emerges from pure noise.

Each step, the model looks at the current noisy state x_t, predicts what noise is present (ε̂), subtracts a scaled version of it to nudge the sample toward cleaner data, and adds a little fresh noise (the stochastic “jitter” that prevents the reverse process from collapsing to a single point). After 1,000 steps, what started as random static has become a point sitting on the spiral.

The problem? This requires 1,000 sequential forward passes through the network. For a large model generating a 512×512 image, that’s painfully slow. Can we do better?

DDIM: Fast Sampling Without Retraining

Song et al. (2020) discovered something remarkable: DDPM’s reverse process isn’t the only one that matches the forward process. There’s an entire family of reverse processes that produce the same marginal distributions at each timestep — and one of them is deterministic.

DDIM (Denoising Diffusion Implicit Models) replaces DDPM’s stochastic reverse step with a deterministic update:

x_t−1 = √ᾱ_t−1 · x̂₀ + √(1 − ᾱ_t−1) · ε̂

where x̂₀ = (x_t − √(1−ᾱ_t) · ε̂) / √ᾱ_t

Because there’s no randomness injected at each step, DDIM doesn’t need to follow the Markov chain one step at a time. It can skip timesteps. Instead of denoising through [999, 998, …, 1, 0], you can use [999, 949, 899, …, 49, 0] — jumping from 1,000 steps to just 20, with minimal quality loss.

This is thematically similar to speculative decoding for language models — both are clever engineering tricks that make generation dramatically faster without changing the output distribution.

def ddim_sample(model, T=1000, alpha_bar=alpha_bar, num_steps=50):
    """Generate a new 2D point using DDIM (deterministic, fewer steps)."""
    # Create a sub-sequence of timesteps
    step_size = T // num_steps
    timesteps = list(range(T - 1, -1, -step_size))  # e.g. [999, 979, ...]
    if timesteps[-1] != 0:
        timesteps.append(0)

    x = np.random.randn(2)

    for i in range(len(timesteps) - 1):
        t_cur = timesteps[i]
        t_next = timesteps[i + 1]

        t_emb = sinusoidal_embedding(t_cur)
        eps_pred = model.forward(x, t_emb)

        # Predict x_0
        x0_pred = (x - np.sqrt(1 - alpha_bar[t_cur]) * eps_pred)
        x0_pred /= np.sqrt(alpha_bar[t_cur])

        # DDIM update — deterministic (no noise added)
        x = (np.sqrt(alpha_bar[t_next]) * x0_pred +
             np.sqrt(1 - alpha_bar[t_next]) * eps_pred)

    return x

# DDPM: 1000 steps → high quality, slow
# DDIM:   50 steps → nearly the same quality, 20x faster
# DDIM:   10 steps → slightly degraded, 100x faster

The speedup is remarkable. At 50 steps, DDIM produces samples virtually indistinguishable from DDPM’s 1,000-step output. Even at 10 steps, the results are usable. This is why real-world diffusion systems rarely run all 1,000 steps — DDIM-style samplers give you most of the quality in a fraction of the time.

Classifier-Free Guidance: Steering Generation

So far our diffusion model generates random samples from the data distribution. But what if you want control? “Generate a spiral, not circles.” “Generate a landscape, not a portrait.”

Ho and Salimans (2022) introduced classifier-free guidance — an elegantly simple technique. During training, randomly drop the class label with some probability (say, 10% of the time) and replace it with a null token. This trains the model to work both conditionally (with a class label) and unconditionally (without one).

At generation time, run the model twice for each step — once conditional, once unconditional — and extrapolate:

ε̃ = ε_uncond + w · (ε_cond − ε_uncond)

w = guidance strength (typically 3–15 for images)

This amplifies the difference between “denoise toward anything” and “denoise toward this specific class.” Higher guidance means stronger adherence to the condition, at the cost of diversity.

If you’ve read the softmax & temperature post, this should ring a bell. Guidance strength w is the temperature knob of diffusion models. Low w ≈ high temperature (diverse, wandering, creative). High w ≈ low temperature (focused, confident, repetitive). The tradeoff is identical — quality versus diversity — just expressed in a different mathematical space.

class ConditionalDenoiseMLP:
    """MLP that accepts an optional class label for conditional generation."""
    def __init__(self, num_classes=3, t_dim=64, c_dim=16, hidden=256):
        scale = 0.01
        # Class embedding (plus one slot for "no class" / unconditional)
        self.class_emb = np.random.randn(num_classes + 1, c_dim) * scale
        input_dim = 2 + t_dim + c_dim
        self.W1 = np.random.randn(input_dim, hidden) * scale
        self.b1 = np.zeros(hidden)
        self.W2 = np.random.randn(hidden, hidden) * scale
        self.b2 = np.zeros(hidden)
        self.W3 = np.random.randn(hidden, 2) * scale
        self.b3 = np.zeros(2)

    def forward(self, x_t, t_emb, class_id=None):
        """Predict noise. class_id=None means unconditional."""
        if class_id is None:
            c_emb = self.class_emb[-1]  # null class embedding
        else:
            c_emb = self.class_emb[class_id]
        inp = np.concatenate([x_t, t_emb, c_emb])
        h = inp @ self.W1 + self.b1
        h = np.maximum(h, 0.01 * h)
        h = h @ self.W2 + self.b2
        h = np.maximum(h, 0.01 * h)
        return h @ self.W3 + self.b3

def guided_sample(model, class_id, w=3.0, num_steps=50,
                   T=1000, alpha_bar=alpha_bar):
    """Generate with classifier-free guidance."""
    step_size = T // num_steps
    timesteps = list(range(T - 1, -1, -step_size))
    if timesteps[-1] != 0:
        timesteps.append(0)

    x = np.random.randn(2)

    for i in range(len(timesteps) - 1):
        t_cur, t_next = timesteps[i], timesteps[i + 1]
        t_emb = sinusoidal_embedding(t_cur)

        # Two forward passes: conditional and unconditional
        eps_cond = model.forward(x, t_emb, class_id=class_id)
        eps_uncond = model.forward(x, t_emb, class_id=None)

        # Guided noise prediction — extrapolate away from unconditional
        eps_guided = eps_uncond + w * (eps_cond - eps_uncond)

        # DDIM update with guided prediction
        x0_pred = (x - np.sqrt(1 - alpha_bar[t_cur]) * eps_guided)
        x0_pred /= np.sqrt(alpha_bar[t_cur])
        x = (np.sqrt(alpha_bar[t_next]) * x0_pred +
             np.sqrt(1 - alpha_bar[t_next]) * eps_guided)

    return x

# guided_sample(model, class_id=0, w=1.0)  → loose spiral
# guided_sample(model, class_id=0, w=5.0)  → tight, confident spiral
# guided_sample(model, class_id=1, w=5.0)  → tight circles

The double forward pass seems expensive — and it is. Every guided generation step costs 2× the compute. This is why diffusion inference is so demanding: 50 DDIM steps with guidance means 100 neural network evaluations per image.

The Deeper Theory: Score Matching

There’s a beautiful unification hiding behind the noise prediction framing. The score function of a probability distribution is the gradient of its log-density: ∇_x log p(x). It points in the direction of increasing probability — toward the data.

It turns out that our noise predictor ε_θ is secretly estimating the score:

∇_{x_t} log q(x_t) ≈ −ε_θ(x_t, t) / √(1 − ᾱ_t)

The predicted noise, rescaled, gives us the direction toward higher probability. DDPM sampling is then equivalent to Langevin dynamics — following the score with added noise, the classic MCMC technique for sampling from unnormalized distributions.

This connection, established by Yang Song and Stefano Ermon (2019), unifies two research traditions: the score-based generative model framework and Ho et al.’s DDPM. Both are doing the same thing with different notation. Score matching learns ∇_x log p(x) directly; DDPM learns ε, which is proportional to it. The synthesis, called the “score SDE” framework, treats the discrete timesteps as a discretization of a continuous stochastic differential equation.

You don’t need the score-based perspective to use diffusion models, but it illuminates why they work: the denoiser is learning the geometry of the data distribution — which directions lead toward real data and which lead away — at every scale of noise.

Connections to the Series

Diffusion models aren’t isolated — they connect to nearly every concept we’ve built in this series:

Loss Functions: MSE, the “boring” loss for classification, becomes the star here. Noise prediction is regression, and MSE is the maximum-likelihood estimator for Gaussian targets.
Optimizers: Adam with lr ≈ 1e-4 trains the denoiser. EMA of weights (a running average, just like Adam’s moment estimates) improves sample quality at inference.
Softmax & Temperature: Guidance strength is the inverse-temperature knob for diffusion. Low guidance = high temperature = diverse but messy. High guidance = low temperature = focused but repetitive.
Normalization: The noise schedule controls the signal-to-noise ratio at each timestep. Just as normalization keeps activations in a productive range, the cosine schedule keeps the SNR in a useful range for learning.
Positional Encoding: Sinusoidal time embeddings use the exact same sin/cos formula we built for encoding token positions. Same math, different domain — encoding “where in the sequence” vs. “how much noise there is.”
Knowledge Distillation: Progressive distillation compresses 1,000 diffusion steps into 4 by training a student model to skip steps. The teacher generates (x_t, x_t−2) pairs; the student learns to jump two steps at once. Repeat, and the student eventually generates in 4 steps what the teacher needed 1,000 for.

Try It: The Diffusion Process

Panel 1: Forward Process — Dissolving a Spiral into Noise

Drag the timestep slider to watch 300 points dissolve from a clean spiral (t=0) into pure Gaussian noise (t=999).

Timestep: t = 0

ᾱ = 1.0000 SNR = 9999 Signal: 100%

Panel 2: Reverse Process — Generating from Noise

Click Generate to watch new points emerge from pure noise. The denoiser iteratively predicts and removes noise until a spiral forms.

Steps: 50

Step: — t: —

Panel 3: Linear vs. Cosine Schedule

Same spiral, same timestep — see how the cosine schedule (right) preserves structure longer than the linear schedule (left).

Timestep: t = 300

Linear Schedule

SNR = 5.82

Cosine Schedule

SNR = 21.6

What Comes Next: From 2D to Images

Everything we built today — the forward process, the noise prediction objective, DDPM sampling, DDIM, classifier-free guidance — transfers directly to image generation. The only things that change are:

The network architecture: Replace our 3-layer MLP with a U-Net (or a Diffusion Transformer / DiT) that processes 2D grids of pixels. The U-Net’s skip connections help preserve fine spatial detail across the denoising process.
The data space: Stable Diffusion doesn’t diffuse in pixel space at all. It first compresses images to a small latent space using a pre-trained autoencoder (VAE), does the diffusion there, then decodes the result back to pixels. This latent diffusion approach reduces computation by 64×.
The conditioning: Instead of simple class labels, image models condition on text embeddings from CLIP or T5 — connecting back to the embeddings and attention posts in this series.

The Diffusion Transformer (DiT) is especially interesting: it replaces the U-Net with a standard transformer architecture — the same one we built across this entire series. Position, attention, FFN, normalization — all the same components, now generating images instead of text. The math rhymes.

We started this series by building every piece of the language model pipeline. Diffusion opens the generative frontier — and it’s built on the same foundations.

References & Further Reading

Ho, Jain & Abbeel — “Denoising Diffusion Probabilistic Models” (2020) — The DDPM paper that launched modern diffusion. Introduced the simplified MSE objective and the linear noise schedule.
Song, Meng & Ermon — “Denoising Diffusion Implicit Models” (2020) — DDIM: deterministic sampling and step-skipping for 10–50× faster generation without retraining.
Nichol & Dhariwal — “Improved Denoising Diffusion Probabilistic Models” (2021) — The cosine noise schedule and learned variance for higher sample quality.
Ho & Salimans — “Classifier-Free Diffusion Guidance” (2022) — The elegantly simple technique for conditional generation via label dropout.
Song & Ermon — “Generative Modeling by Estimating Gradients of the Data Distribution” (2019) — Score-based generative models: the precursor that showed denoising and score estimation are the same thing.
Kingma & Gao — “Understanding Diffusion Objectives as the ELBO” (2023) — A rigorous treatment of how the simplified objective relates to the variational bound.
Peebles & Xie — “Scalable Diffusion Models with Transformers” (DiT, 2023) — Replacing U-Nets with transformers for diffusion: the architecture we already built.
Lilian Weng — “What Are Diffusion Models?” — One of the best overviews of the mathematical foundations. Clear notation and derivations.
DadOps cross-references: Loss Functions, Optimizers, Softmax & Temperature, Normalization, Positional Encoding, Knowledge Distillation, Speculative Decoding, Embeddings, Attention