Flow Matching from Scratch: The Simpler Path from Noise to Data
Why Flow Matching?
What if you could skip the thousand-step Markov chain entirely and just draw a straight line from noise to data?
In our diffusion models post, we built the full DDPM pipeline: a carefully designed noise schedule, a 1000-step forward process that destroys images, and a neural network that learns to reverse each step. It works beautifully — but it carries a lot of baggage. You need a noise schedule (linear? cosine? learned?). Training requires computing cumulative products of schedule parameters. Sampling means running a reverse Markov chain that inherits the curved geometry of the forward process, requiring dozens or hundreds of steps even with the DDIM shortcut.
Flow matching, introduced by Lipman et al. (2022) and independently by Albergo & Vanden-Eijnden (2022), strips all of that away. The core idea is breathtakingly simple: learn a velocity field that transports samples along straight lines from noise to data. No noise schedule. No Markov chain. No variance explosion. Just a neural network that predicts the direction and speed to move at every point along a continuous path from pure Gaussian noise (t=0) to clean data (t=1).
This isn't just an academic simplification. Flow matching is the algorithm behind Stable Diffusion 3, Flux, and the latest generation of image generators. The industry switched because straight paths mean fewer sampling steps, simpler training, and better results. In this post we'll build it from scratch, implement the training loop, derive the sampling algorithms, and see how the "reflow" trick makes one-step generation possible.
From Diffusion to Flows — The Continuous Perspective
Let's reframe generative modeling as a transportation problem. We have a source distribution p0 (standard Gaussian noise — easy to sample from) and a target distribution p1 (our data — images, audio, whatever we want to generate). We need to build a highway between them.
Diffusion models build this highway as a Markov chain: a sequence of stochastic steps, each one adding or removing a tiny bit of noise. The path from noise to data winds through a thousand waypoints, following the curved geometry dictated by the noise schedule.
Flow matching takes a different approach. Instead of discrete steps, we define a continuous velocity field v(x, t) that tells every particle where to go at every moment. If you drop a particle into the noise distribution at t=0 and let it ride the velocity field to t=1, it should arrive in the data distribution. Mathematically, this is an ordinary differential equation (ODE):
dx/dt = vθ(x, t)
Start at x0 ~ N(0, I), integrate forward to t=1, and arrive at x1 ≈ data. Think of it like rivers flowing from a lake (noise) to an ocean (data). Every drop of water has a velocity at every point along the way. The velocity field is the complete description of the flow.
The problem: learning this velocity field directly is intractable. We'd need to know the probability density pt at every intermediate time — the entire evolving distribution of particles as they flow from noise to data. We don't have that. This is where the breakthrough comes in.
Conditional Flow Matching — The Training Trick
The key insight from Lipman et al. is that we don't need to learn the global velocity field directly. Instead, we can decompose it into simple conditional flows and train on those.
Given a data sample x1, define the simplest possible path from a noise sample x0 to x1: a straight line.
xt = (1 - t) · x0 + t · x1
At t=0 we're at pure noise. At t=1 we're at pure data. In between, we're at a weighted mixture. The velocity along this path is dead simple — just the constant difference between the endpoints:
ut(x | x1) = x1 - x0
No schedule. No cumulative products. The conditional velocity is a single subtraction. And here's the remarkable theorem: training a neural network to match these conditional velocities produces a network that, at inference time, approximates the correct marginal velocity field — the one that transports the entire noise distribution to the entire data distribution. The conditional and marginal flow matching losses have identical gradients.
The training algorithm is almost suspiciously simple:
import numpy as np
def flow_matching_training_step(model, data_batch):
"""One training step of Conditional Flow Matching.
Compare with DDPM, which requires:
- A noise schedule (beta_1, ..., beta_T)
- Cumulative products (alpha_bar_t)
- The reparameterization x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
Flow matching needs none of that.
"""
batch_size = data_batch.shape[0]
# Sample noise and random timestep
x_0 = np.random.randn(*data_batch.shape) # noise
x_1 = data_batch # data
t = np.random.uniform(0, 1, (batch_size, 1)) # time ~ U[0, 1]
# Straight-line interpolation (the "path")
x_t = (1 - t) * x_0 + t * x_1
# The target velocity: constant along the straight line
velocity_target = x_1 - x_0
# Network predicts the velocity at (x_t, t)
velocity_pred = model(x_t, t)
# MSE loss — that's it
loss = np.mean((velocity_pred - velocity_target) ** 2)
return loss
Four essential operations: sample noise, interpolate, subtract, minimize MSE. Compare this with the DDPM training loop from our diffusion post, which required a noise schedule, cumulative products ᾱt, and the reparameterization formula. The flow matching version is conceptually cleaner: we're just teaching the network to point from where a particle is to where it should go.
One subtlety worth noting: both diffusion and flow matching ultimately minimize an MSE loss between a network prediction and a target. In diffusion, the target is the noise ε that was added. In flow matching, the target is the velocity v = x1 - x0. These are related by a linear transformation (the noise schedule parameters), but the flow matching version doesn't need those parameters at all.
Sampling — Solving the ODE
Training teaches the network to predict velocities. Sampling means following those velocities from noise to data. We start at x0 ~ N(0, I) and integrate dx/dt = vθ(x, t) from t=0 to t=1. This is a standard ODE initial-value problem, and we can throw any numerical integrator at it.
Here are three options in increasing order of accuracy:
def euler_sample(model, x_0, num_steps=20):
"""Euler method: simplest ODE solver.
1 network evaluation per step. Error: O(h) total.
For straight paths, this is nearly exact even with few steps.
"""
x = x_0.copy()
dt = 1.0 / num_steps
for i in range(num_steps):
t = i * dt
v = model(x, t)
x = x + dt * v
return x
def midpoint_sample(model, x_0, num_steps=20):
"""Midpoint method: 2nd-order ODE solver.
2 network evaluations per step. Error: O(h^2) total.
Same quality as Euler in half the steps.
"""
x = x_0.copy()
dt = 1.0 / num_steps
for i in range(num_steps):
t = i * dt
# Evaluate at current point
v1 = model(x, t)
# Take a half-step to the midpoint
x_mid = x + 0.5 * dt * v1
# Evaluate at midpoint
v2 = model(x_mid, t + 0.5 * dt)
# Full step using midpoint velocity
x = x + dt * v2
return x
def rk4_sample(model, x_0, num_steps=20):
"""Runge-Kutta 4th order: classic high-accuracy solver.
4 network evaluations per step. Error: O(h^4) total.
Diminishing returns for straight paths — overkill if flow is well-trained.
"""
x = x_0.copy()
dt = 1.0 / num_steps
for i in range(num_steps):
t = i * dt
k1 = model(x, t)
k2 = model(x + 0.5 * dt * k1, t + 0.5 * dt)
k3 = model(x + 0.5 * dt * k2, t + 0.5 * dt)
k4 = model(x + dt * k3, t + dt)
x = x + (dt / 6.0) * (k1 + 2*k2 + 2*k3 + k4)
return x
The crucial insight: the choice of solver depends on how straight the learned paths are. If the velocity field produces perfectly straight trajectories, then Euler with just one step is mathematically exact — one straight line needs only one line segment. For nearly-straight paths, Euler with 10–20 steps is plenty. The higher-order solvers (midpoint, RK4) only help when paths have significant curvature.
This is a stark contrast with diffusion models, where the reverse process inherently follows curved paths through the noise schedule. DDIM needs at least 20–50 steps for decent quality. Flow matching, by training on straight-line targets, biases the learned velocity field toward straight paths from the start — and we can straighten them further.
In production, Stable Diffusion 3 and Flux use Euler with 20–50 steps. Flux-schnell (the distilled variant) generates images in 1–4 steps. How? That's where rectified flow comes in.
Rectified Flow — Straightening the Paths
Even though we train on straight-line targets, the learned velocity field doesn't produce perfectly straight paths. Why not? Because independent random coupling between noise and data creates crossing trajectories.
Imagine two noise samples x0A and x0B mapped to two data samples x1A and x1B. If the straight lines from A's noise to A's data and from B's noise to B's data happen to cross each other in the middle, the learned velocity field has to compromise — at the crossing point, it can't point in two directions at once. This compromise introduces curvature, bending paths away from perfect straightness.
Rectified Flow, introduced by Liu et al. (2022), fixes this with an elegant iterative procedure called reflow:
- Train an initial flow model v1 on independent (noise, data) pairs
- Generate new couplings: for each noise sample x0, run the ODE forward to get z1 = ODE(x0; v1). Now (x0, z1) is a deterministic coupling — not random anymore
- Retrain a new flow v2 on these deterministic pairs. Since x0 and z1 are already connected by a smooth flow, the new straight-line targets cross less
- Repeat — each iteration reduces crossing and straightens paths
def reflow(model_old, data_samples, num_ode_steps=50):
"""Reflow: generate straighter (noise, data) couplings.
Instead of pairing random noise with random data (lots of crossing),
we pair each noise sample with the data point that the current model
*actually transports it to*. This deterministic coupling has fewer
crossings, so the retrained model learns straighter paths.
"""
n = data_samples.shape[0]
# Sample fresh noise
x_0 = np.random.randn(n, data_samples.shape[1])
# Transport noise to data using the current model
z_1 = euler_sample(model_old, x_0, num_steps=num_ode_steps)
# The new training pairs: (x_0, z_1) instead of (x_0, x_1_random)
# These pairs have less crossing => straighter retraining targets
return x_0, z_1
def measure_straightness(model, x_0, x_1, num_eval_points=20):
"""Measure how straight the learned paths are.
A perfectly straight path has x_t = (1-t)*x_0 + t*x_1 and velocity
dx/dt = x_1 - x_0 everywhere. Straightness measures deviation from this.
S = 0 means perfectly straight. Larger S means more curvature.
"""
dt = 1.0 / num_eval_points
total_deviation = 0.0
x = x_0.copy()
ideal_velocity = x_1 - x_0 # constant velocity for a straight line
for i in range(num_eval_points):
t = i * dt
predicted_velocity = model(x, t)
deviation = np.mean((predicted_velocity - ideal_velocity) ** 2)
total_deviation += deviation
x = x + dt * predicted_velocity
return total_deviation / num_eval_points
We measure straightness as the average deviation between the predicted velocity and the constant straight-line velocity. A perfectly straight path has S = 0: the velocity is the same everywhere along the trajectory, just as it would be for a constant-velocity straight line.
There's also a connection to optimal transport. Straight paths minimize the total distance particles travel (transport cost). In the limit of infinite reflow iterations, the flow converges toward the optimal transport map — the minimum-cost way to rearrange the noise distribution into the data distribution. You don't need to know optimal transport theory to use flow matching, but it's satisfying to know the math is well-grounded.
The practical payoff: after 1–2 reflow iterations, paths become nearly straight enough for single-step generation. This is exactly how Flux-schnell works — train the flow, reflow to straighten, and generate in 1–4 Euler steps.
Flow Matching vs Diffusion — Side by Side
Let's put the two approaches head to head:
| Aspect | DDPM / DDIM | Flow Matching |
|---|---|---|
| Forward process | Fixed noise schedule, Markov chain | Linear interpolation, no schedule |
| Network predicts | Noise ε | Velocity v = x1 - x0 |
| Training loss | ||ε - εθ(xt, t)||² | ||v - vθ(xt, t)||² |
| Sampling method | SDE (DDPM) or ODE (DDIM) | ODE only |
| Typical step count | 20 – 1000 | 1 – 50 |
| Path geometry | Curved (schedule-dependent) | Straight (or near-straight) |
| Noise schedule | Required (linear, cosine, etc.) | Not needed |
| Few-step generation | Requires distillation | Built in (reflow) |
There's a deep mathematical connection between the two. DDIM with a specific noise schedule is equivalent to Euler integration of a particular ODE — and that ODE is related to the flow matching ODE by a linear transformation of the prediction target. The noise prediction ε and the velocity prediction v = x1 - x0 are different parameterizations of the same underlying quantity, connected through the schedule parameters ᾱt.
So what actually differs? Three things matter in practice:
- Loss weighting: How much each timestep contributes to training. Diffusion's schedule parameters implicitly weight some timesteps more than others. Flow matching with uniform t ~ U[0,1] weights all timesteps equally, though modern implementations often use logit-normal sampling to emphasize the middle of the trajectory.
- Path geometry: Straight-line interpolation produces straighter learned paths than the curved schedule-dependent interpolation, enabling fewer sampling steps.
- Simplicity: No schedule to tune, no cumulative products to compute, no reparameterization gymnastics. The training loop is four lines of essential logic.
The industry voted with their codebases: Stability AI chose flow matching for Stable Diffusion 3. Black Forest Labs chose it for Flux. The trend is clear — when you can get the same results with a simpler formulation, the simpler formulation wins.
Guidance and Conditioning
A generative model that can't be steered isn't very useful. Classifier-free guidance (CFG) — the trick that makes text-to-image generation follow prompts — works in flow matching too. Instead of interpolating noise predictions, we interpolate velocity predictions:
vguided = vθ(x, t, ∅) + w · (vθ(x, t, c) - vθ(x, t, ∅))
Here c is the conditioning (e.g., a text prompt encoded by CLIP, as we explored in a previous post), ∅ is the null/unconditional embedding, and w is the guidance scale. Higher w means stronger adherence to the prompt at the cost of reduced diversity. Training is the same as in diffusion: randomly drop the conditioning 10% of the time to train the unconditional path.
There's one practical wrinkle that diffusion doesn't have. At very early timesteps (t ≈ 0), the input is almost pure noise, so the network's velocity estimates are unreliable. Standard CFG amplifies these errors, producing artifacts. The fix, sometimes called CFG-Zero*, is simple: zero out or heavily dampen the guidance signal for the first few steps, letting the model find its footing before steering.
Stable Diffusion 3 uses an MMDiT (Multimodal Diffusion Transformer) architecture with joint text-image attention, dual text encoders (CLIP + T5), and logit-normal timestep sampling to emphasize the middle of the trajectory where most of the structural generation happens. Flux refines this further. But under the hood, the training objective is the same conditional flow matching loss we just built.
Putting It All Together — Training on Toy Data
Let's train a flow matching model end to end on a 2D dataset. We'll use a Swiss roll — the same dataset from our diffusion post — so you can directly compare the two approaches.
import numpy as np
# --- Generate Swiss Roll data ---
def make_swiss_roll(n=2000):
t = np.random.uniform(1.5 * np.pi, 4.5 * np.pi, n)
x = t * np.cos(t) / 10.0
y = t * np.sin(t) / 10.0
return np.stack([x, y], axis=1)
# --- Simple MLP velocity network ---
class VelocityNet:
"""2-layer MLP that predicts velocity given (x, t).
Input: [x_dim + 1] (position + time)
Output: [x_dim] (velocity)
"""
def __init__(self, x_dim=2, hidden=128):
scale = 0.01
self.W1 = np.random.randn(x_dim + 1, hidden) * scale
self.b1 = np.zeros(hidden)
self.W2 = np.random.randn(hidden, hidden) * scale
self.b2 = np.zeros(hidden)
self.W3 = np.random.randn(hidden, x_dim) * scale
self.b3 = np.zeros(x_dim)
def forward(self, x, t):
# Concatenate position and time
t_col = np.full((x.shape[0], 1), t) if np.isscalar(t) else t.reshape(-1, 1)
inp = np.concatenate([x, t_col], axis=1)
# Two hidden layers with SiLU activation
h = inp @ self.W1 + self.b1
h = h * (1 / (1 + np.exp(-h))) # SiLU = x * sigmoid(x)
h = h @ self.W2 + self.b2
h = h * (1 / (1 + np.exp(-h))) # SiLU
return h @ self.W3 + self.b3
__call__ = forward # make instances callable: model(x, t)
# --- Training loop ---
data = make_swiss_roll(5000)
model = VelocityNet(x_dim=2, hidden=128)
lr = 1e-3
for step in range(5000):
# Sample a batch
idx = np.random.choice(len(data), 256)
x_1 = data[idx] # data
x_0 = np.random.randn(256, 2) # noise
t = np.random.uniform(0, 1, (256, 1)) # time
# Flow matching: interpolate, compute target, predict
x_t = (1 - t) * x_0 + t * x_1 # path
target = x_1 - x_0 # velocity
pred = model(x_t, t) # prediction
loss = np.mean((pred - target) ** 2) # MSE loss
# (In practice, use PyTorch autograd for backprop here)
# This shows the forward pass logic — the training objective is just MSE
if step % 1000 == 0:
print(f"Step {step}: loss = {loss:.4f}")
# --- Generate samples with Euler ---
def generate(model, n=500, steps=20):
x = np.random.randn(n, 2)
dt = 1.0 / steps
for i in range(steps):
t = i * dt
v = model(x, t)
x = x + dt * v
return x
samples_50 = generate(model, steps=50)
samples_20 = generate(model, steps=20)
samples_5 = generate(model, steps=5)
The training loop is the essence of flow matching: interpolate along a straight line, compute the constant velocity target, predict, minimize MSE. The complete forward pass logic fits in four lines. At inference, Euler integration walks particles from noise to data by repeatedly asking "which way should I move?" and taking a step in that direction.
Now let's apply reflow to straighten the paths:
# --- Reflow: straighten the learned paths ---
def generate_reflow_pairs(model, n=5000, steps=50):
"""Create (noise, data) pairs by running the trained model.
These pairs have deterministic coupling — less crossing than random pairs.
"""
x_0 = np.random.randn(n, 2)
z_1 = x_0.copy()
dt = 1.0 / steps
for i in range(steps):
t = i * dt
v = model(z_1, t)
z_1 = z_1 + dt * v
return x_0, z_1
# Generate reflowed training pairs
noise_paired, data_paired = generate_reflow_pairs(model, n=5000)
# Retrain on the reflowed pairs (same training loop, different data source)
model_v2 = VelocityNet(x_dim=2, hidden=128)
for step in range(5000):
idx = np.random.choice(len(noise_paired), 256)
x_0 = noise_paired[idx]
x_1 = data_paired[idx]
t = np.random.uniform(0, 1, (256, 1))
x_t = (1 - t) * x_0 + t * x_1
target = x_1 - x_0
pred = model_v2(x_t, t)
loss = np.mean((pred - target) ** 2)
# (backprop and optimizer step in practice)
# After reflow: fewer steps needed for the same quality
samples_reflow_5 = generate(model_v2, steps=5)
samples_reflow_1 = generate(model_v2, steps=1)
# Compare: 5-step reflow samples should match 50-step original
The reflow procedure doesn't change the training algorithm — it changes the data. Instead of pairing random noise with random data points (which creates crossing paths), we pair each noise sample with the specific data point the current model transports it to. Since these pairs are already connected by a smooth flow, the straight-line targets between them have fewer crossings. The retrained model learns straighter paths, enabling generation in fewer steps.
Try It: Flow Field Visualizer
Watch particles flow from Gaussian noise (blue) to a Swiss roll (red) along the learned velocity field. Toggle "Compare Diffusion" to see the curved diffusion paths versus the straight flow matching paths. Adjust step count to see how fewer steps affect sample quality — flow matching stays sharp with fewer steps because paths are straighter.
Try It: Reflow Explorer
The reflow procedure straightens paths by replacing random (noise, data) couplings with deterministic ones. Toggle between "Before Reflow" (curved, crossing paths) and "After Reflow" (straighter paths). Adjust the step count and observe that reflowed paths produce better samples with fewer steps.
Drawing Straight Lines
Flow matching reframes generative modeling as a transportation problem: learn a velocity field that moves samples from noise to data. Train by regressing onto straight-line velocities. Sample by integrating an ODE. Straighten paths via reflow for fewer-step generation.
The progression from DDPM to flow matching is a story of removing unnecessary complexity:
- DDPM: 1000 stochastic steps, fixed noise schedule, curved Markov chain paths
- DDIM: 50 deterministic steps, same schedule, slightly less curved ODE paths
- Flow Matching: 20 ODE steps, no schedule, straight-line training targets
- Rectified Flow: 1–4 Euler steps, reflowed couplings, nearly straight paths
Each step in this progression removes a piece of scaffolding that turned out to be an artifact of the formulation rather than a fundamental requirement. The Markov chain was scaffolding. The noise schedule was scaffolding. Even the curved paths were scaffolding. What remains is the core: a neural network that learns the shortest route from noise to data.
Sometimes the best improvement is simplification. Diffusion models worked — brilliantly — but the thousand-step reverse process was always a computational compromise, not a mathematical necessity. Flow matching found the straight line hiding underneath.
References & Further Reading
- Lipman et al. — Flow Matching for Generative Modeling (2022) — The paper that introduced conditional flow matching, the key training trick that makes this practical
- Liu et al. — Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow (2022) — Rectified flow and the reflow procedure for straightening paths
- Esser et al. — Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (2024) — The Stable Diffusion 3 paper, applying flow matching at scale with MMDiT
- Albergo & Vanden-Eijnden — Building Normalizing Flows with Stochastic Interpolants (2022) — Independent derivation of similar ideas through stochastic interpolants
- Ho et al. — Denoising Diffusion Probabilistic Models (2020) — The original DDPM paper, the predecessor that flow matching simplifies
- Song et al. — Denoising Diffusion Implicit Models (2020) — DDIM, the bridge between diffusion's Markov chain and the ODE perspective
- Salimans & Ho — Progressive Distillation for Fast Sampling of Diffusion Models (2022) — v-prediction parameterization, a conceptual bridge to flow matching velocity
- Tong et al. — Improving and Generalizing Flow-Based Generative Models with Mini-Batch Optimal Transport (2023) — OT-CFM: using optimal transport within minibatches to further straighten paths
- Lilian Weng — Flow-based Deep Generative Models (Lil'Log) — Excellent tutorial on the broader family of flow-based models