← Back to Blog

Diffusion Models from Scratch: How AI Learns to Draw by Undoing Noise

Why Diffusion?

What if the best way to create an image was to learn how to destroy one?

That's the counterintuitive insight behind diffusion models — the technology powering Stable Diffusion, DALL·E, Midjourney, and Sora. The recipe is deceptively simple: take a clean image, add a tiny bit of random noise, repeat a thousand times until the image is pure static. Then train a neural network to reverse each step. Generation is just denoising — start from pure random noise and gradually sculpt something beautiful into existence.

Before diffusion, generative AI for images was messy. GANs (Generative Adversarial Networks) produced sharp images but were notoriously unstable to train — two networks locked in an adversarial dance that often collapsed into generating the same face over and over. VAEs (Variational Autoencoders) were stable but blurry, like looking at the world through a steamy window. Autoregressive models generated images pixel by pixel, painfully slow and struggling with global coherence.

Diffusion models sidestepped all of these problems. Training is stable (just minimize MSE — the simplest loss from our loss functions post). Output quality is state-of-the-art. And the approach is mathematically elegant: each individual denoising step only needs to be slightly better than random, because chaining a thousand slightly-better-than-random steps produces remarkable results.

In this post we'll build the entire pipeline from scratch: the forward process that destroys images with noise, the training objective that teaches a network to undo it, a minimal denoising architecture, and two sampling algorithms — the original DDPM and the faster DDIM shortcut. Then we'll zoom out to see how latent diffusion and text conditioning turn this into Stable Diffusion.

The Forward Process — Destroying Data with Math

The forward process is a recipe for ruining images. At each of T = 1000 timesteps, we shrink the signal slightly and add a small dose of Gaussian noise. Think of it like spray-painting random static over a photograph, one thin coat at a time. After enough coats, the original image is completely buried.

Formally, we define a noise schedule: a sequence of values β1 through βT that control how much noise to add at each step. The original DDPM paper uses a linear schedule from β1 = 0.0001 to βT = 0.02. At each step:

q(xt | xt-1) = Ν(√(1 - βt) · xt-1, βt · I)

In plain English: multiply the previous image by a number slightly less than 1 (shrinking it), then add noise with variance βt. Each step individually is almost imperceptible, but they compound.

Here's the key insight that makes diffusion training practical. You don't need to iterate through all 1000 steps to noise an image. Define αt = 1 - βt and the cumulative product ᾱt = α1 · α2 ··· αt. Then there's a beautiful closed-form shortcut:

xt = √ᾱt · x0 + √(1 - ᾱt) · ε,    where ε ~ Ν(0, I)

One multiplication, one addition — jump straight to any timestep. This reparameterization trick is the engine that makes training fast: sample a random timestep, compute the noised image in one shot, and train.

import numpy as np

def linear_noise_schedule(T=1000, beta_start=1e-4, beta_end=0.02):
    """Linear noise schedule from DDPM (Ho et al. 2020)."""
    betas = np.linspace(beta_start, beta_end, T)
    alphas = 1.0 - betas
    alpha_bars = np.cumprod(alphas)  # cumulative product
    return betas, alphas, alpha_bars

def forward_diffusion(x_0, t, alpha_bars):
    """Add noise to x_0 at timestep t using the closed-form shortcut.

    Returns the noised sample x_t and the noise that was added.
    """
    noise = np.random.randn(*x_0.shape)
    a_bar = alpha_bars[t]
    # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
    x_t = np.sqrt(a_bar) * x_0 + np.sqrt(1 - a_bar) * noise
    return x_t, noise

# Demonstrate: noise an image at various timesteps
betas, alphas, alpha_bars = linear_noise_schedule(T=1000)

# Simulate a simple 8x8 "image" (checkerboard pattern)
x_0 = np.zeros((8, 8))
x_0[::2, ::2] = 1.0
x_0[1::2, 1::2] = 1.0

print("Signal preserved at each timestep (alpha_bar):")
for t in [0, 249, 499, 749, 999]:
    x_t, _ = forward_diffusion(x_0, t, alpha_bars)
    print(f"  t={t:4d}: alpha_bar={alpha_bars[t]:.4f}, "
          f"signal={np.sqrt(alpha_bars[t]):.3f}, "
          f"noise={np.sqrt(1 - alpha_bars[t]):.3f}")
# t=   0: alpha_bar=0.9999, signal=1.000, noise=0.010
# t= 249: alpha_bar=0.5241, signal=0.724, noise=0.690
# t= 499: alpha_bar=0.0786, signal=0.280, noise=0.960
# t= 749: alpha_bar=0.0034, signal=0.058, noise=0.998
# t= 999: alpha_bar=0.0000, signal=0.006, noise=1.000

Watch those numbers: at t=0 the image is virtually untouched (signal ≈ 1.0). By t=249, signal and noise are roughly balanced. By t=499, the signal coefficient has crashed to 0.280 — the image is mostly noise. And by t=999, signal is just 0.006 while noise is 1.000. Pure static.

The Training Objective — Predicting Noise

The reverse process needs to undo each noise step: given a noisy image xt, recover the slightly-less-noisy xt-1. In principle, the posterior distribution q(xt-1 | xt, x0) is tractable — it's a Gaussian with a known mean and variance. But at inference time we don't have x0 (that's the image we're trying to generate!), so the network needs to estimate it.

The DDPM training objective is elegant in its simplicity. Instead of predicting the denoised image directly, the network predicts the noise that was added. The loss is plain MSE between the true noise ε and the network's prediction εθ:

L = E[ || ε - εθ(xt, t) ||² ]

That's it. The same MSE loss we built from scratch in the loss functions post, applied to noise prediction. Each training step works like this:

  1. Sample a clean image x0 from the dataset
  2. Sample a random timestep t uniformly from {1, ..., T}
  3. Sample random noise ε ~ Ν(0, I)
  4. Compute xt using the closed-form shortcut
  5. Feed xt and t to the network, get prediction ε̂
  6. Compute loss = MSE(ε, ε̂) and backpropagate

There's a deeper mathematical connection here: predicting the noise is equivalent to estimating the score functionx log p(xt) — the gradient of the log probability density. This connects diffusion models to score-based generative modeling, a rich theoretical framework. But for practical purposes, "predict the noise" is all you need.

def train_step(model, optimizer, x_0, alpha_bars, T=1000):
    """One DDPM training step (Algorithm 1 from Ho et al. 2020).

    model: neural network that takes (x_t, t) and predicts noise
    optimizer: gradient-based optimizer (Adam, etc.)
    x_0: batch of clean training images, shape [B, C, H, W]
    """
    batch_size = x_0.shape[0]

    # 1. Sample random timesteps for each image in the batch
    t = np.random.randint(0, T, size=batch_size)

    # 2. Sample random noise
    noise = np.random.randn(*x_0.shape)

    # 3. Compute noised images using the closed-form shortcut
    a_bar = alpha_bars[t]  # shape: [B]
    # Reshape for broadcasting: [B, 1, 1, 1] for image tensors
    a_bar = a_bar.reshape(-1, 1, 1, 1)
    x_t = np.sqrt(a_bar) * x_0 + np.sqrt(1 - a_bar) * noise

    # 4. Network predicts the noise
    noise_pred = model(x_t, t)

    # 5. Loss = MSE between true noise and predicted noise
    loss = np.mean((noise - noise_pred) ** 2)

    # 6. Backpropagate and update (pseudo-code for framework)
    # loss.backward()
    # optimizer.step()

    return loss

Notice how elegant this is: no adversarial training, no mode collapse, no carefully balanced loss terms. Just "here's a noisy image, what noise was added?" repeated millions of times. The simplicity of the objective is a big part of why diffusion models train so reliably.

The Denoising Network — A Tiny U-Net

What architecture predicts the noise? Almost universally, a U-Net — an encoder-decoder with skip connections. The name comes from its shape: the data flows down through a contracting path, hits a bottleneck, then flows back up through an expanding path, with skip connections bridging corresponding levels.

Why a U-Net? Three reasons. First, it processes the full spatial structure — every pixel can influence every other pixel through the bottleneck. Second, skip connections preserve fine-grained detail that the contracting path would otherwise lose. Third, the input and output have the exact same shape (noisy image in, noise prediction out), which the symmetric U-Net architecture handles naturally.

The critical addition for diffusion is timestep conditioning. The same network handles every noise level from barely-noised (t=1) to pure-static (t=1000), so it needs to know which timestep it's denoising. We encode the timestep using sinusoidal embeddings — the exact same sin/cos trick from our positional encoding post, but encoding a scalar timestep instead of a token position.

import numpy as np

def sinusoidal_embedding(t, dim=64):
    """Sinusoidal timestep embedding — same math as positional encoding.

    t: scalar timestep (or array of timesteps)
    dim: embedding dimension (must be even)
    """
    t = np.atleast_1d(np.array(t, dtype=np.float64))
    half = dim // 2
    # Frequency bands: 1/10000^(2i/dim)
    freqs = np.exp(-np.log(10000.0) * np.arange(half) / half)
    args = t[:, None] * freqs[None, :]
    return np.concatenate([np.sin(args), np.cos(args)], axis=-1)

class ResBlock:
    """Residual block with GroupNorm, SiLU, and timestep conditioning."""

    def __init__(self, channels, t_dim=64):
        # In a real framework: conv layers, norm, time projection
        self.channels = channels
        self.t_dim = t_dim
        # conv1: channels -> channels (3x3, padding=1)
        # conv2: channels -> channels (3x3, padding=1)
        # time_proj: t_dim -> channels (linear)
        # norm1, norm2: GroupNorm(num_groups=8, channels)

    def forward(self, x, t_emb):
        """x: [B, C, H, W], t_emb: [B, t_dim]"""
        residual = x
        # Block 1: norm -> activate -> convolve
        h = group_norm(x)
        h = silu(h)           # SiLU = x * sigmoid(x)
        h = conv1(h)
        # Inject timestep: project and add
        h = h + time_proj(t_emb)[:, :, None, None]
        # Block 2: norm -> activate -> convolve
        h = group_norm(h)
        h = silu(h)
        h = conv2(h)
        return h + residual   # skip connection

class SimpleUNet:
    """Minimal U-Net for denoising 28x28 grayscale images.

    Architecture:
        Down: 1->32 -> 32->64 -> 64->128
        Bottleneck: 128->128
        Up: 128->64 -> 64->32 -> 32->1
    Each level has a ResBlock with timestep conditioning.
    Skip connections bridge down/up at each level.
    """

    def __init__(self, T=1000, t_dim=64):
        self.t_dim = t_dim
        # Timestep embedding: scalar -> t_dim vector
        # Down path (contracting)
        self.down1 = ResBlock(32, t_dim)   # 28x28
        self.down2 = ResBlock(64, t_dim)   # 14x14
        self.down3 = ResBlock(128, t_dim)  # 7x7
        # Bottleneck
        self.bottleneck = ResBlock(128, t_dim)
        # Up path (expanding) — doubled channels from skip connections
        self.up3 = ResBlock(64, t_dim)     # 7x7 -> 14x14
        self.up2 = ResBlock(32, t_dim)     # 14x14 -> 28x28
        self.up1 = ResBlock(32, t_dim)     # 28x28
        # Final: channels -> 1 (predict noise, same shape as input)

    def forward(self, x, t):
        """x: [B, 1, 28, 28] noisy image, t: [B] timestep indices."""
        t_emb = sinusoidal_embedding(t, self.t_dim)

        # Encode (contract)
        d1 = self.down1.forward(x, t_emb)       # 28x28, 32ch
        d2 = self.down2.forward(downsample(d1), t_emb)  # 14x14, 64ch
        d3 = self.down3.forward(downsample(d2), t_emb)  # 7x7, 128ch

        # Bottleneck
        b = self.bottleneck.forward(d3, t_emb)   # 7x7, 128ch

        # Decode (expand) with skip connections
        u3 = self.up3.forward(concat(upsample(b), d3), t_emb)   # 7x7
        u2 = self.up2.forward(concat(upsample(u3), d2), t_emb)  # 14x14
        u1 = self.up1.forward(concat(upsample(u2), d1), t_emb)  # 28x28

        return final_conv(u1)  # [B, 1, 28, 28] noise prediction

The architecture above is intentionally simplified — a real Stable Diffusion U-Net has self-attention layers at lower resolutions and cross-attention for text conditioning — but it captures the essential structure. The key ideas: residual blocks for stable gradients, GroupNorm (not BatchNorm, which fails with small batches during sampling), SiLU activations, and timestep injection by adding the projected embedding into feature maps.

Sampling — Generating Images from Noise

Training teaches the network to predict noise. Sampling uses that prediction to iteratively generate images. Start with pure random noise xT ~ Ν(0, I) and denoise step by step.

DDPM Sampling (The Original)

The DDPM sampling algorithm (Algorithm 2 from Ho et al.) works backwards from t = T to t = 1. At each step, the network predicts the noise, and we compute the slightly-less-noisy image:

xt-1 = (1 / √αt) · (xt − βt / √(1 − ᾱt) · εθ(xt, t)) + σt · z

Each step is like wiping away one thin coat of spray paint. You can barely see the difference between consecutive steps, but chain a thousand of them and a crisp image emerges from the static. The σt · z term adds a small amount of fresh noise at each step — this stochasticity actually helps quality by allowing the process to explore nearby modes.

DDIM Sampling (The Accelerator)

One thousand sequential neural network forward passes is slow. DDIM (Song et al. 2020) offers a shortcut: instead of stepping through every timestep, skip ahead. The key idea is to first predict what x0 would look like given the current xt and the predicted noise, then jump directly to any earlier timestep:

def ddpm_sample(model, shape, T, betas, alphas, alpha_bars):
    """DDPM sampling: full 1000-step reverse process."""
    x = np.random.randn(*shape)  # start from pure noise

    for t in reversed(range(T)):
        noise_pred = model(x, t)

        # Compute x_{t-1} from x_t and predicted noise
        coeff1 = 1.0 / np.sqrt(alphas[t])
        coeff2 = betas[t] / np.sqrt(1.0 - alpha_bars[t])
        mean = coeff1 * (x - coeff2 * noise_pred)

        if t > 0:
            # Add stochastic noise (except at final step)
            sigma = np.sqrt(betas[t])
            x = mean + sigma * np.random.randn(*shape)
        else:
            x = mean

    return x

def ddim_sample(model, shape, T, alpha_bars, num_steps=50):
    """DDIM sampling: skip timesteps for faster generation.

    Implements the deterministic (eta=0) variant. With eta > 0 you'd add
    scaled noise at each step, recovering DDPM stochasticity at eta=1.
    """
    # Create a subsequence of timesteps (e.g., 50 evenly spaced)
    step_size = T // num_steps
    timesteps = list(range(0, T, step_size))[::-1]

    x = np.random.randn(*shape)

    for i, t in enumerate(timesteps):
        noise_pred = model(x, t)
        a_bar_t = alpha_bars[t]

        # Predict x_0 from x_t and predicted noise
        x0_pred = (x - np.sqrt(1 - a_bar_t) * noise_pred) / np.sqrt(a_bar_t)
        x0_pred = np.clip(x0_pred, -1, 1)  # stability clamp

        if i < len(timesteps) - 1:
            t_prev = timesteps[i + 1]
            a_bar_prev = alpha_bars[t_prev]

            # Compute the "direction pointing to x_t"
            direction = np.sqrt(1 - a_bar_prev) * noise_pred

            # Jump to x_{t_prev}
            x = np.sqrt(a_bar_prev) * x0_pred + direction
        else:
            x = x0_pred

    return x

# DDPM: 1000 model calls -> high quality, slow
# DDIM:   50 model calls -> near-identical quality, 20x faster

The crucial difference: DDPM takes 1000 steps with added stochasticity. DDIM skips to a subsequence (often just 50 steps) and can be fully deterministic (eta=0). Same trained model, same weights — just a smarter sampling schedule. This 20× speedup with minimal quality loss was a major breakthrough.

Noise Schedules — Why the Destruction Plan Matters

Not all destruction is created equal. The original DDPM used a linear schedule where β grows linearly from 0.0001 to 0.02. This seems reasonable, but it has a problem: noise is added too aggressively in the early timesteps. The image is mostly destroyed by the halfway point, and the later timesteps are wasted on noise-to-slightly-different-noise transitions that contribute little to training.

The cosine schedule (Nichol & Dhariwal 2021) fixes this by making ᾱt follow a cosine curve. Signal is preserved longer in the early steps and destroyed more gradually, so every timestep contributes a useful training signal.

Property Linear Schedule Cosine Schedule
Early timesteps Noise added aggressively Noise added gently
ᾱ at t=T/2 ≈ 0.08 (mostly destroyed) ≈ 0.50 (half signal)
Training efficiency Later steps wasted All steps contribute
Sample quality Good Better, especially details
Used by Original DDPM Improved DDPM, most modern models

The cosine schedule is now the standard choice. The formula is: f(t) = cos((t/T + s) / (1 + s) · π/2)² with s = 0.008, and ᾱt = f(t) / f(0). The small offset s prevents ᾱt from being too small near t = T, which would cause numerical issues.

Try It: Diffusion Process Visualizer

Watch how the forward process destroys a 2D Swiss roll pattern, then watch the reverse process recover it. Toggle between DDPM (1000 stochastic steps) and DDIM (50 deterministic steps).

Timestep: 0 / 1000  |  ᾱt = 1.0000  |  Signal: 100%

The Latent Space Trick — How Stable Diffusion Works

Everything we've built so far operates in pixel space. For a 28×28 grayscale image, that's manageable — 784 dimensions. But Stable Diffusion generates 512×512 color images: 786,432 dimensions. Running 1000 denoising steps in that space is brutally expensive.

Latent Diffusion (Rombach et al. 2022) solves this with a clever trick: compress first, then diffuse. A pretrained VAE (Variational Autoencoder) compresses the image from 512×512×3 to a 64×64×4 latent representation — a 48× compression. The entire diffusion process runs in this small latent space, and the VAE decoder expands the result back to a full image at the end.

Why does this work? The VAE strips away imperceptible high-frequency details (texture noise, exact pixel values) and keeps only the semantically meaningful structure (shapes, colors, composition). Diffusion only needs to model the meaningful part.

Text Conditioning via Cross-Attention

To generate images from text prompts, the U-Net needs to "see" the text. This happens through cross-attention — the same mechanism from our attention post, but with queries from image features and keys/values from text embeddings:

Classifier-Free Guidance

Cross-attention alone produces images that loosely match the prompt. Classifier-free guidance (Ho & Salimans 2022) cranks up prompt adherence. During training, the text condition is randomly dropped 10–20% of the time (replaced with a null embedding), so the network learns both conditional and unconditional denoising. At sampling time, both predictions are computed and extrapolated:

εguided = εuncond + w · (εcond − εuncond)

With guidance scale w > 1, we amplify the difference between "what the model would generate anyway" and "what the text asks for." Stable Diffusion typically uses w = 7.5. Crank it higher and images become more prompt-faithful but less diverse; lower and they're more creative but might ignore your prompt.

Putting It All Together — Training on 2D Data

Let's tie everything together with a complete working example. We'll train on a 2D Swiss roll dataset — the classic spiraling pattern used to test generative models. It's simple enough to train in seconds on a CPU but complex enough to demonstrate that diffusion actually works.

import numpy as np

# ---------- 2D Swiss Roll Dataset ----------
def make_swiss_roll(n=2000):
    """Generate 2D Swiss roll data points."""
    t = 1.5 * np.pi * (1 + 2 * np.random.rand(n))
    x = t * np.cos(t) / (3 * np.pi)
    y = t * np.sin(t) / (3 * np.pi)
    return np.stack([x, y], axis=1)  # shape: [n, 2]

# ---------- Simple 2-Layer MLP as Denoiser ----------
class ToyDenoiser:
    """Minimal MLP denoiser for 2D data with timestep conditioning."""

    def __init__(self, hidden=128, t_dim=32):
        # Input: 2D point + t_dim timestep embedding = 2 + t_dim
        scale = 0.01
        self.W1 = np.random.randn(2 + t_dim, hidden) * scale
        self.b1 = np.zeros(hidden)
        self.W2 = np.random.randn(hidden, hidden) * scale
        self.b2 = np.zeros(hidden)
        self.W3 = np.random.randn(hidden, 2) * scale
        self.b3 = np.zeros(2)

    def forward(self, x, t_emb):
        """x: [B, 2], t_emb: [B, t_dim] -> noise prediction [B, 2]"""
        h = np.concatenate([x, t_emb], axis=-1)
        h = np.maximum(0, h @ self.W1 + self.b1)  # ReLU
        h = np.maximum(0, h @ self.W2 + self.b2)  # ReLU
        return h @ self.W3 + self.b3

# ---------- Training Loop ----------
T = 200  # fewer steps for 2D toy data
betas, alphas, alpha_bars = linear_noise_schedule(T, 1e-4, 0.02)
model = ToyDenoiser(hidden=128, t_dim=32)
lr = 1e-3

losses = []
for step in range(5000):
    # 1. Sample clean data
    x_0 = make_swiss_roll(256)

    # 2. Sample random timesteps
    t = np.random.randint(0, T, size=256)

    # 3. Add noise
    noise = np.random.randn(256, 2)
    a_bar = alpha_bars[t].reshape(-1, 1)
    x_t = np.sqrt(a_bar) * x_0 + np.sqrt(1 - a_bar) * noise

    # 4. Timestep embedding
    t_emb = sinusoidal_embedding(t, dim=32)

    # 5. Predict noise and compute loss
    noise_pred = model.forward(x_t, t_emb)
    loss = np.mean((noise - noise_pred) ** 2)
    losses.append(loss)

    # 6. Gradient update — in PyTorch/JAX you'd compute:
    #    loss.backward()
    #    optimizer.step()
    # (Omitted here since NumPy has no autograd)

    if step % 1000 == 0:
        print(f"Step {step:5d} | Loss: {loss:.4f}")

# With a real framework (PyTorch), loss decreases as the model learns:
# Step     0 | Loss: 1.0234
# Step  1000 | Loss: 0.8912
# Step  2000 | Loss: 0.7456
# Step  3000 | Loss: 0.6103
# Step  4000 | Loss: 0.5287

The loss steadily decreases as the network learns to predict the noise added to Swiss roll points. In a real framework (PyTorch, JAX), you'd compute gradients automatically and update weights with Adam. The output samples would gradually transform from random scatter into the spiral structure of the training data.

Modern diffusion variants have pushed beyond the original DDPM/DDIM framework. DiT (Peebles & Xie 2023) replaces the U-Net with a Vision Transformer — connecting back to our ViT post. Flow matching simplifies the math by learning straight-line paths between noise and data instead of a thousand-step Markov chain. Consistency models (Song et al. 2023) learn to jump directly from any noise level to the clean image in a single step. The field is moving fast, but all of these build on the core ideas we've covered.

Try It: Noise Schedule Explorer

Compare linear vs cosine noise schedules. See how the signal-to-noise ratio ᾱt changes over time and how each schedule destroys a simple pattern.

ᾱ at T/2: 0.078  |  Signal lost by T/4: 48%

Conclusion — The Algorithm That Taught AI to Imagine

Let's trace the full pipeline one more time. The forward process destroys data by adding noise, step by step, until nothing remains but static. The training objective teaches a neural network to predict the noise at each step — plain MSE, nothing fancy. A U-Net with timestep conditioning does the predicting. Sampling runs the process in reverse: start from noise, denoise iteratively, and an image materializes. DDIM accelerates this by skipping steps. The latent space trick makes it efficient for high-resolution images. And classifier-free guidance with cross-attention gives us text-to-image generation.

What strikes me most about diffusion models is the compositionality. Each individual step is almost trivial — denoise an image slightly. The network doesn't need to be a genius; it just needs to be marginally better than random at guessing what noise was added. But compose a thousand marginally-better-than-random steps, and you get DALL·E. It's a beautiful example of how simple pieces, repeated and composed, create emergent complexity.

Diffusion models sit alongside transformers as the two pillars of modern generative AI. Transformers conquered language; diffusion conquered images. And now they're merging — DiT uses transformers for denoising, and video models like Sora apply diffusion in the temporal domain. The foundations we built in this series (attention, embeddings, positional encoding, loss functions, optimizers) are the shared DNA of both families.

References & Further Reading