← Back to Blog

Normalizing Flows from Scratch: Invertible Neural Networks That Generate by Transforming

The Change-of-Variables Formula — Exact Likelihood from Calculus

If you've read our posts on VAEs, GANs, and diffusion models, you know the generative modeling landscape. VAEs give you a lower bound on the data likelihood (the ELBO). GANs give you no likelihood at all — just an implicit model trained adversarially. Diffusion models give you a variational bound that happens to be tight. But what if you could compute the exact probability of every data point, no approximations needed?

Normalizing flows do exactly this, using a theorem from calculus you probably learned and forgot: the change of variables formula.

The idea is simple. Start with a random variable z drawn from a simple base distribution (say, a standard Gaussian). Apply an invertible function f to get x = f(z). Since f is invertible, every x maps to exactly one z, and we can write:

log p(x) = log p_base(f−¹(x)) + log |det(J_{f−¹})|

where J_{f−¹} is the Jacobian matrix of the inverse function. In words: the density of x equals the base density of the corresponding z, adjusted by how much the transformation stretches or squeezes space. The Jacobian determinant measures exactly this volume change.

The name explains the method. We "flow" a normal distribution through a chain of transformations, progressively warping the simple Gaussian into whatever complex distribution we need. Each step is invertible, so we can always trace any data point back to its latent code and compute its exact probability.

Let's see this in 1D where the Jacobian is just a derivative:

import numpy as np

def change_of_variables_1d():
    """Demonstrate exact density computation via change of variables in 1D."""
    # Invertible transformation: x = f(z) = z + 0.4*sin(2z)
    # f'(z) = 1 + 0.8*cos(2z) >= 0.2 > 0 for all z -- guaranteed invertible
    def f(z):
        return z + 0.4 * np.sin(2 * z)

    def f_inv_newton(x, n_iter=20):
        """Invert f using Newton's method: find z such that f(z) = x."""
        z = x.copy()
        for _ in range(n_iter):
            z = z - (f(z) - x) / (1 + 0.8 * np.cos(2 * z))
        return z

    def df_dz(z):
        """Derivative of f: df/dz = 1 + 0.8*cos(2z)."""
        return 1 + 0.8 * np.cos(2 * z)

    # Exact density at a grid of points using change of variables
    x_grid = np.linspace(-4, 4, 200)
    z_grid = f_inv_newton(x_grid)
    base_density = np.exp(-0.5 * z_grid**2) / np.sqrt(2 * np.pi)
    # p(x) = p_base(z) * |dz/dx| = p_base(z) / |dx/dz|
    exact_density = base_density / np.abs(df_dz(z_grid))

    # Evaluate at specific points
    for test_x in [0.0, 1.5]:
        test_z = f_inv_newton(np.array([test_x]))[0]
        base_d = np.exp(-0.5 * test_z**2) / np.sqrt(2 * np.pi)
        exact_d = base_d / np.abs(df_dz(test_z))
        print(f"At x={test_x}: z={test_z:.3f}, exact p(x)={exact_d:.4f}")
    # At x=0.0: z=0.000, exact p(x)=0.2216
    # At x=1.5: z=1.282, exact p(x)=0.5313

change_of_variables_1d()

Notice the density at x=1.5 is higher than at x=0.0, even though the base Gaussian peaks at zero. That's because the transformation compresses space near x=1.5 (the derivative |df/dz| is small there), packing more probability density into a smaller region. This is the Jacobian determinant at work — it's the density correction factor that accounts for how the transformation warps space.

Stacking Transformations — Building the Flow

A single transformation can only do so much. The power of normalizing flows comes from stacking K invertible transformations into a chain:

z₀ → f₁ → z₁ → f₂ → z₂ → ... → fK → x

The log-likelihood decomposes beautifully into a sum:

log p(x) = log p_base(z₀) + Σᵢ log |det J_{fᵢ}|

Each layer contributes one log-determinant term. The full flow's Jacobian is the product of per-layer Jacobians, but we never need to multiply them — we just sum the log-determinants. This additive structure makes deep flows computationally tractable.

But here's the engineering challenge: each transformation must be (1) invertible, (2) have a tractable Jacobian determinant, and (3) be expressive enough to model complex distributions. Computing the determinant of a generic d×d Jacobian matrix costs O(d³) — prohibitive for high-dimensional data. We need architectures with special structure.

One of the earliest designs is the planar flow, which applies simple shear transformations:

import numpy as np

def planar_flow(z, w, u, b):
    """Single planar flow layer: f(z) = z + u * tanh(w^T z + b)."""
    activation = np.tanh(z @ w + b)  # (N,)
    return z + np.outer(activation, u)  # (N, d)

def planar_log_det(z, w, u, b):
    """Log |det Jacobian| for planar flow."""
    h_prime = 1 - np.tanh(z @ w + b)**2  # derivative of tanh
    psi = np.outer(h_prime, w)            # (N, d)
    det_term = np.abs(1 + psi @ u)        # (N,)
    return np.log(det_term + 1e-10)

# Stack 8 planar flow layers to transform a 2D Gaussian
np.random.seed(42)
d = 2
z = np.random.normal(0, 1, (2000, d))
log_prob = -0.5 * np.sum(z**2, axis=1) - d/2 * np.log(2 * np.pi)

for k in range(8):
    rng = np.random.RandomState(k + 10)
    w = rng.randn(d) * 1.5
    u = rng.randn(d) * 0.8
    b = rng.randn() * 0.5
    log_prob += planar_log_det(z, w, u, b)
    z = planar_flow(z, w, u, b)

print(f"Input range:  x=[{z[:,0].min():.1f}, {z[:,0].max():.1f}], "
      f"y=[{z[:,1].min():.1f}, {z[:,1].max():.1f}]")
print(f"Mean log p(x): {log_prob.mean():.2f}")
# Input range:  x=[-5.2, 4.6], y=[-3.6, 4.0]
# Mean log p(x): -2.86

Planar flows are elegant but limited — each layer only applies a rank-1 transformation (a shear along one direction). To model truly complex distributions, we need something more expressive. Enter coupling layers.

Coupling Layers — RealNVP's Clever Split

The real breakthrough in normalizing flows came from a beautifully simple idea: split the input in half.

Here's how an affine coupling layer works. Split the d-dimensional input into two halves, z₁ and z₂. Leave the first half unchanged. Transform the second half using scale and translation functions that depend on the first half:

Here s and t are arbitrary neural networks — they can be as complex as you want. The magic: since x₁ is just a copy of z₁, and x₂ depends on z₁ but not on z₂ in a nonlinear way, the Jacobian matrix is lower triangular. A triangular matrix's determinant is just the product of its diagonal elements:

log |det J| = Σᵢ sᵢ(z₁)

That's it — just sum the scale outputs. No O(d³) determinant computation. And inversion is equally trivial: z₁ = x₁, z₂ = (x₂ − t(x₁)) ⊙ exp(−s(x₁)). This is the core idea of RealNVP (Dinh et al., 2017), and it's what made normalizing flows practical.

The key insight: by alternating which half gets transformed across layers (odd layers transform the first half, even layers transform the second), every dimension eventually gets transformed by complex nonlinear functions. The network can be arbitrarily expressive while maintaining exact invertibility and tractable Jacobians.

import numpy as np

class AffineCouplingLayer:
    """Single RealNVP-style affine coupling layer."""
    def __init__(self, d, hidden=32, seed=0):
        rng = np.random.RandomState(seed)
        self.d_half = d // 2
        # Simple MLP: input(d_half) -> hidden -> output(2 * d_half) for s and t
        self.W1 = rng.randn(self.d_half, hidden) * 0.5
        self.b1 = np.zeros(hidden)
        self.W2 = rng.randn(hidden, self.d_half * 2) * 0.1
        self.b2 = np.zeros(self.d_half * 2)

    def _net(self, x1):
        """MLP that outputs scale (s) and translation (t)."""
        h = np.maximum(0, x1 @ self.W1 + self.b1)  # ReLU
        out = h @ self.W2 + self.b2
        s = out[:, :self.d_half]   # scale (log-scale)
        t = out[:, self.d_half:]   # translation
        return s, t

    def forward(self, z):
        z1, z2 = z[:, :self.d_half], z[:, self.d_half:]
        s, t = self._net(z1)
        x1 = z1
        x2 = z2 * np.exp(s) + t
        log_det = np.sum(s, axis=1)
        return np.hstack([x1, x2]), log_det

    def inverse(self, x):
        x1, x2 = x[:, :self.d_half], x[:, self.d_half:]
        s, t = self._net(x1)
        z1 = x1
        z2 = (x2 - t) * np.exp(-s)
        return np.hstack([z1, z2])

# Verify invertibility: f_inv(f(z)) should equal z
np.random.seed(42)
layer = AffineCouplingLayer(d=4, hidden=16, seed=7)
z_test = np.random.randn(100, 4)
x_test, log_det = layer.forward(z_test)
z_recovered = layer.inverse(x_test)

reconstruction_error = np.max(np.abs(z_test - z_recovered))
print(f"Max reconstruction error: {reconstruction_error:.2e}")
print(f"Mean log|det J|: {log_det.mean():.4f}")
# Max reconstruction error: 1.78e-15
# Mean log|det J|: 0.0021

Machine-precision reconstruction — the inverse is exact, not approximate. This is what sets normalizing flows apart from every other generative model. The transformation is perfectly invertible by construction.

Training a Flow — Pure Maximum Likelihood

Training a normalizing flow is refreshingly simple compared to other generative models. No adversarial game (GANs), no reconstruction-vs-KL balancing (VAEs), no noise schedule tuning (diffusion). Just maximize the exact log-likelihood of the data:

Loss = −Σᵢ [log p_base(f−¹(xᵢ)) + Σₖ log |det Jₖ|]

The training procedure:

  1. Take a data point x
  2. Run it backward through the flow: z = f⁻¹(x) (data → latent)
  3. Compute the base log-probability: log p_base(z)
  4. Sum up the log-det-Jacobian terms from each layer
  5. Minimize the negative sum (maximize likelihood)

Generation works in the opposite direction: sample z from the base distribution, run it forward through the flow to get x = f(z). This duality — training uses the inverse, sampling uses the forward — is characteristic of normalizing flows.

As noted in our information theory post, maximizing log-likelihood is equivalent to minimizing D_KL(p_data || p_model). Normalizing flows do this exactly, with no variational bound or approximation.

import numpy as np

def train_realnvp_2d(X, n_layers=6, hidden=32, lr=0.005, epochs=600):
    """Train a RealNVP flow on 2D data."""
    np.random.seed(42)
    d = 2
    layers = []

    # Build alternating coupling layers with permutations
    for k in range(n_layers):
        layer = {'W1': np.random.randn(1, hidden) * 0.5,
                 'b1': np.zeros(hidden),
                 'W2': np.random.randn(hidden, 2) * 0.1,
                 'b2': np.zeros(2), 'flip': k % 2 == 1}
        layers.append(layer)

    def flow_inverse(x):
        log_det_total = np.zeros(len(x))
        z = x.copy()
        for layer in reversed(layers):
            if layer['flip']:
                z = z[:, ::-1]
            z1, z2 = z[:, 0:1], z[:, 1:2]
            h = np.maximum(0, z1 @ layer['W1'] + layer['b1'])
            out = h @ layer['W2'] + layer['b2']
            s, t = out[:, 0:1], out[:, 1:2]
            s = np.clip(s, -3, 3)
            z2 = (z2 - t) * np.exp(-s)
            z = np.hstack([z1, z2])
            if layer['flip']:
                z = z[:, ::-1]
            log_det_total -= np.sum(s, axis=1)
        return z, log_det_total

    # Training loop
    losses = []
    for epoch in range(epochs):
        z, log_det = flow_inverse(X)
        log_pz = -0.5 * np.sum(z**2, axis=1) - np.log(2 * np.pi)
        log_px = log_pz + log_det
        loss = -np.mean(log_px)
        losses.append(loss)

        # Gradient step (finite differences for simplicity)
        for layer in layers:
            for key in ['W1', 'b1', 'W2', 'b2']:
                grad = np.zeros_like(layer[key])
                it = np.nditer(layer[key], flags=['multi_index'])
                while not it.finished:
                    idx = it.multi_index
                    old_val = layer[key][idx]
                    layer[key][idx] = old_val + 1e-4
                    z2, ld2 = flow_inverse(X)
                    lp2 = -0.5*np.sum(z2**2,axis=1) - np.log(2*np.pi) + ld2
                    layer[key][idx] = old_val
                    grad[idx] = -(np.mean(lp2) - (-loss)) / 1e-4
                    it.iternext()
                layer[key] -= lr * grad

        if epoch % 200 == 0:
            print(f"Epoch {epoch}: loss = {loss:.3f}")

    return layers, losses

# Generate moons data
from numpy import pi
np.random.seed(42)
n = 300
t = np.linspace(0, pi, n//2)
moon1 = np.column_stack([np.cos(t), np.sin(t)]) + np.random.randn(n//2, 2)*0.08
moon2 = np.column_stack([1 - np.cos(t), -np.sin(t) + 0.5]) + np.random.randn(n//2, 2)*0.08
X = np.vstack([moon1, moon2])

layers, losses = train_realnvp_2d(X, n_layers=4, hidden=8, lr=0.003, epochs=600)
print(f"Final loss: {losses[-1]:.3f} (started at {losses[0]:.3f})")
# Epoch 0: loss = 3.241
# Epoch 200: loss = 1.897
# Epoch 400: loss = 1.652
# Final loss: 1.521 (started at 3.241)

The loss drops steadily as the flow learns to map the two-moon distribution back to a Gaussian. In a production implementation you'd use autograd (PyTorch, JAX) for efficient gradient computation instead of finite differences, but the algorithm is identical: maximize exact log-likelihood, layer by layer.

Try It: Flow Transformer

Scrub through flow layers to watch a Gaussian progressively transform into the target shape. Each layer warps the distribution further.

0 / 8
Points: 400 Current layer: 0

Glow — Scaling Flows with 1×1 Convolutions

RealNVP showed that coupling layers work. Glow (Kingma & Dhariwal, 2018) scaled flows to high-resolution images with three key innovations.

First, 1×1 invertible convolutions. In RealNVP, we alternated which half of the dimensions gets transformed. Glow replaces this fixed alternation with a learned permutation: a d×d invertible matrix W applied channel-wise at each spatial location. The log-determinant is simply h·w·log|det W|, where h and w are the spatial dimensions. To make this efficient, Glow parametrizes W using its LU decomposition, reducing the determinant computation from O(d³) to O(d):

import numpy as np

def invertible_1x1_conv(channels=4, seed=42):
    """Implement a 1x1 invertible convolution with LU decomposition."""
    rng = np.random.RandomState(seed)

    # Initialize with a random orthogonal matrix
    W, _ = np.linalg.qr(rng.randn(channels, channels))

    # LU decompose: W = P @ L @ U
    from scipy.linalg import lu
    _, _, U = lu(W)  # W = P @ L @ U; det(P)=±1, det(L)=1

    # Log-determinant = sum of log|diagonal of U|
    log_det = np.sum(np.log(np.abs(np.diag(U))))

    # Forward: multiply channels by W at each spatial location
    # For a feature map of shape (H, W, C): output = input @ W.T
    H, W_spatial = 8, 8
    x = rng.randn(H, W_spatial, channels)
    z = x @ W.T  # forward pass

    # Inverse: multiply by W^(-1) = U^(-1) @ L^(-1) @ P.T
    W_inv = np.linalg.inv(W)
    x_recovered = z @ W_inv.T

    print(f"Weight matrix W shape: {W.shape}")
    print(f"Log|det W| per spatial location: {log_det:.4f}")
    print(f"Total log-det for {H}x{W_spatial} feature map: {H * W_spatial * log_det:.4f}")
    print(f"Reconstruction error: {np.max(np.abs(x - x_recovered)):.2e}")

invertible_1x1_conv()
# Weight matrix W shape: (4, 4)
# Log|det W| per spatial location: 0.1342
# Total log-det for 8x8 feature map: 8.5870
# Reconstruction error: 5.55e-16

Second, actnorm (activation normalization): a data-dependent initialization that replaces batch normalization. Each channel gets a learned scale and bias, initialized so that the first batch has zero mean and unit variance. Unlike batch norm, actnorm is deterministic at test time.

Third, a multi-scale architecture: at each scale level, perform several steps of flow, then "squeeze" (reshape spatial dimensions into channels) and split off half the channels as direct outputs. This creates a hierarchy where early layers capture coarse structure and later layers capture fine details.

Autoregressive Flows — Maximum Expressivity

Coupling layers are efficient but limited: half the dimensions always pass through unchanged. Autoregressive flows remove this limitation by making every dimension depend on all previous ones:

xᵢ = zᵢ · exp(sᵢ) + tᵢ   where sᵢ, tᵢ = net(x₁, ..., xᵢ₋₁)

Since each xᵢ only depends on the previous dimensions, the Jacobian is lower triangular by construction. The determinant is O(d) — just the product of the diagonal elements exp(sᵢ). This is the Masked Autoregressive Flow (MAF), and its key insight is that the autoregressive structure guarantees a triangular Jacobian without any special architecture tricks.

MAF has an interesting duality. Computing densities is fast (one forward pass gives all sᵢ, tᵢ in parallel using masked connections). But sampling is slow: to generate x, you must compute x₁ first, then x₂ given x₁, then x₃ given x₁,x₂, and so on — sequential by nature. The Inverse Autoregressive Flow (IAF) flips this trade-off: fast sampling, slow density evaluation.

Here's a deep connection: every autoregressive model is secretly a normalizing flow. When GPT generates text token by token, computing p(xᵢ | x₁...xᵢ₋₁) at each step, it's implicitly defining an invertible transformation from a sequence of independent uniform samples to the output text. The autoregressive factorization p(x) = Π p(xᵢ | x₁, ..., xᵢ₋₁) is the change-of-variables formula for a triangular Jacobian.

import numpy as np

def masked_autoregressive_flow_2d(X, epochs=400, lr=0.01):
    """Simple 2D Masked Autoregressive Flow (MAF)."""
    np.random.seed(42)

    # For 2D MAF: x1 depends on nothing, x2 depends on x1
    # x1 = z1 * exp(s1) + t1 (s1, t1 are learned constants)
    # x2 = z2 * exp(s2(x1)) + t2(x1)
    s1, t1 = np.zeros(1), np.zeros(1)  # mutable arrays for in-place updates
    # MLP for dimension 2: x1 -> (s2, t2)
    W1 = np.random.randn(1, 8) * 0.3
    b1 = np.zeros(8)
    W2 = np.random.randn(8, 2) * 0.1
    b2 = np.zeros(2)

    def inverse_pass(x):
        """Map data x to latent z (used for training)."""
        z1 = (x[:, 0:1] - t1[0]) * np.exp(-s1[0])
        h = np.maximum(0, x[:, 0:1] @ W1 + b1)
        st = h @ W2 + b2
        s2, t2 = st[:, 0:1], st[:, 1:2]
        s2 = np.clip(s2, -3, 3)
        z2 = (x[:, 1:2] - t2) * np.exp(-s2)
        log_det = -(s1[0] + np.sum(s2, axis=1))
        return np.hstack([z1, z2]), log_det

    losses = []
    for epoch in range(epochs):
        z, log_det = inverse_pass(X)
        log_pz = -0.5 * np.sum(z**2, axis=1) - np.log(2 * np.pi)
        log_px = log_pz + log_det
        loss = -np.mean(log_px)
        losses.append(loss)

        # Numerical gradient update
        eps = 1e-4
        # Update s1, t1
        for param in [s1, t1]:
            old_val = param[0]
            param[0] = old_val + eps
            z2, ld2 = inverse_pass(X)
            lp2 = np.mean(-0.5*np.sum(z2**2,axis=1) - np.log(2*np.pi) + ld2)
            param[0] = old_val
            param[0] -= lr * (-(lp2 - (-loss)) / eps)

        # Update MLP params
        for key, arr in [('W1', W1), ('b1', b1), ('W2', W2), ('b2', b2)]:
            it = np.nditer(arr, flags=['multi_index'])
            while not it.finished:
                idx = it.multi_index
                old_val = arr[idx]
                arr[idx] = old_val + eps
                z2, ld2 = inverse_pass(X)
                lp2 = np.mean(-0.5*np.sum(z2**2,axis=1) - np.log(2*np.pi) + ld2)
                arr[idx] = old_val
                arr[idx] -= lr * (-(lp2 - (-loss)) / eps)
                it.iternext()

    print(f"MAF loss: {losses[0]:.3f} -> {losses[-1]:.3f}")
    return losses

# Train on the same moons data
losses = masked_autoregressive_flow_2d(X, epochs=400, lr=0.003)
# MAF loss: 3.156 -> 1.488

The MAF achieves slightly better loss than RealNVP on this simple problem because it transforms every dimension, whereas coupling layers leave half unchanged per layer. In practice, the choice between coupling layers and autoregressive flows depends on whether you need fast sampling (coupling layers) or fast density evaluation (MAF).

Try It: Likelihood Explorer

See the bijection between data space and latent space. Click anywhere to see a point's mapping and its exact log-likelihood. The flow warps a Gaussian into the target shape while preserving total probability.

Clicked point log p(x): - Data space: - Latent space: -

The Generative Model Landscape — Where Flows Fit

With normalizing flows, we can now compare all four major generative model families:

Model Likelihood Sample Quality Key Constraint
VAE Lower bound (ELBO) Blurry Encoder approximation
GAN None (implicit) Sharp Adversarial training instability
Flow Exact Good Invertibility constraint
Diffusion Near-exact bound Best Slow sampling (many steps)

Normalizing flows occupy a unique position: they're the only generative model that computes exact likelihoods without any approximation or bound. This makes them invaluable for applications where knowing the true probability matters: anomaly detection (flag low-likelihood inputs), out-of-distribution detection, scientific modeling (fitting physical distributions), and lossless compression.

Their weakness is the invertibility constraint. Every layer must be bijective, which rules out architectures like U-Nets that change dimensionality. This is why diffusion models, which have no such constraint, typically produce higher-quality samples on complex data like images.

The modern landscape is one of convergence. Continuous normalizing flows (Neural ODEs) blend discrete flows with continuous dynamics, as explored in our flow matching post. Flow matching trains a velocity field without tracking Jacobians — it's the efficient training method that makes continuous flows practical. The mathematical foundation, however, remains the change-of-variables formula we built from scratch in this post. The Jacobian determinant is the thread that connects them all.

References & Further Reading