Flow Matching and Diffusion Models Learning Notes

16 minute read

Published:

This post derives Flow Matching and Diffusion Models from first principles. We start from the fundamental question of generative modeling, build the necessary mathematical tools, and arrive at practical training objectives.

1. The Generative Modeling Problem

The central goal: given samples \(X \sim p_{\text{data}}(X)\) from an unknown data distribution \(p_{\text{data}}(X)\), learn a model that can generate new samples from \(p_{\text{data}}(X)\).

A powerful idea is to learn a transport map \(T\) that transforms a simple, known distribution \(p_{\text{init}}\) (e.g., Gaussian noise \(\mathcal{N}(0, I)\)) into the data distribution \(p_{\text{data}}\):

\[X_0 \sim p_{\text{init}} = \mathcal{N}(0, I) \quad \xrightarrow{T} \quad X_1 \sim p_{\text{data}}(X)\]

The question is: how do we construct such a map?


2. Continuous Normalizing Flows (CNFs)

2.1 From Discrete to Continuous

Instead of constructing \(T\) in one step, we define a continuous path from noise to data. Consider an ordinary differential equation (ODE) parameterized by a velocity field \(u_t: \mathbb{R}^d \to \mathbb{R}^d\):

\[\frac{\mathrm{d}X_t}{\mathrm{d}t} = u_t(X_t), \quad t \in [0, 1]\]

Starting from \(X_0 \sim p_{\text{init}}\), this ODE defines a flow \(\phi_t\) that maps \(X_0\) to \(X_t = \phi_t(X_0)\). The flow generates a time-dependent probability path \(p_t\), where:

\[p_0 = p_{\text{init}}, \quad p_1 \approx p_{\text{data}}(X)\]

3. From ODE to SDE: Why Add Randomness?

3.1 The Limitation of Deterministic Flows

The CNF defines a deterministic ODE \(\frac{\mathrm{d}X_t}{\mathrm{d}t} = u_t(X_t)\). Given an initial point \(X_0\), the trajectory \(X_t\) is uniquely determined. This is elegant but has a fundamental limitation:

  • Trajectories cannot cross in an ODE (uniqueness theorem). In high dimensions, this forces the velocity field to learn complex, winding paths to rearrange probability mass without crossing.
  • The model must transport every point deterministically — there is no room to “spread out” or “concentrate” probability through randomness.

3.2 Adding Noise: The Langevin Intuition

Consider the simplest example of noise-driven dynamics: Langevin dynamics for sampling from a known distribution \(\pi(X) \propto e^{-U(X)}\):

\[\mathrm{d}X_t = -\nabla U(X_t) \, \mathrm{d}t + \sigma_t \, \mathrm{d}W_t\]

Here \(\mathrm{d}W_t\) represents infinitesimal Gaussian noise (a Wiener process). The drift \(-\nabla U(X)\) pushes \(X_t\) toward regions of high probability, while the noise \(\mathrm{d}W_t\) enables exploration. Under mild conditions, this process converges to the stationary distribution \(\pi(X)\).

Key insight: the noise term transforms a deterministic gradient flow into a stochastic process that can explore the target distribution rather than merely flowing toward it.

3.3 The General SDE

An SDE generalizes the ODE by adding a noise term:

\[\underbrace{\mathrm{d}X_t = u_t^\theta(X_t) \, \mathrm{d}t}_{\text{deterministic drift (ODE part)}} + \underbrace{g_t \, \mathrm{d}W_t}_{\text{stochastic diffusion}}\]

If we parameterize the drift using a neural network representing the velocity field, we get an SDE like:

\[\mathrm{d}X_t = u_t^\theta(X_t) \, \mathrm{d}t + \sigma_t \, \mathrm{d}W_t\]

When \(\sigma_t = 0\), we recover the deterministic ODE. When \(\sigma_t > 0\), the trajectories become stochastic: the same initial \(X_0\) gives rise to different \(X_t\) on each realization. Instead of transporting individual points, the SDE transports distributions.

3.4 The Fokker-Planck Equation

Just as the ODE has the continuity equation, the SDE has its own evolution equation for densities — the Fokker-Planck equation (also called the Kolmogorov forward equation):

\[\frac{\partial p_t(X)}{\partial t} = -\nabla \cdot \left[ f_t(X) \, p_t(X) \right] + \frac{g_t^2}{2} \nabla^2 p_t(X)\]

Compare this with the continuity equation for the ODE:

\[\frac{\partial p_t(X)}{\partial t} = -\nabla \cdot \left[ u_t(X) \, p_t(X) \right]\]

The extra term \(\frac{g_t^2}{2} \nabla^2 p_t\) is a diffusion term — it spreads out the density, acting like heat diffusion. This is exactly why the noise makes SDEs more flexible: the diffusion term provides an additional mechanism for shaping the probability density beyond what the drift alone can achieve.

3.5 SDE as a Generalization of ODE

We can now see the hierarchy clearly:

 ODESDE
Equation\(\mathrm{d}X_t = u_t(X_t) \, \mathrm{d}t\)\(\mathrm{d}X_t = u_t(X_t) \, \mathrm{d}t + \sigma_t \, \mathrm{d}W_t\)
TrajectoriesDeterministicStochastic
Density evolutionContinuity equationFokker-Planck equation
Transport mechanismDrift onlyDrift + diffusion
Path crossingsForbiddenAllowed (in distribution)

3.6 Simulating ODEs and SDEs

Once we have trained our model to predict the velocity field \(u_t^\theta\) or score, we must simulate the ODE or SDE to generate data. Since these differential equations rarely have analytical solutions, we rely on numerical integration.

3.6.1 Euler Method for ODEs

The simplest approach to solve an ODE \(\mathrm{d}X_t = u_t(X_t) \, \mathrm{d}t\) is the Euler method. We discretize time into small steps \(\Delta t\):

\[X_{t + \Delta t} = X_t + u_t^\theta(X_t) \Delta t\]

While computationally cheap, the Euler method assumes the velocity field is constant over the step \(\Delta t\), which can lead to significant discretization errors if the trajectory curves sharply.

3.6.2 Heun’s Method for ODEs

To improve accuracy without requiring excessively small steps, we can use higher-order solvers like Heun’s method (a 2nd-order Runge-Kutta method). It computes a “predictor” step using Euler, and then applies a “corrector” step by averaging the velocity at the start and the predicted end:

  1. Predictor: \(\tilde{X}_{t + \Delta t} = X_t + u_t^\theta(X_t) \Delta t\)
  2. Corrector: \(X_{t + \Delta t} = X_t + \frac{\Delta t}{2} \left[ u_t^\theta(X_t) + u_{t + \Delta t}^\theta(\tilde{X}_{t + \Delta t}) \right]\)

This provides much more accurate trajectories, allowing for fewer total network evaluations, which is highly beneficial for generation speed.

3.6.3 Euler-Maruyama Method for SDEs

When simulating an SDE \(\mathrm{d}X_t = u_t(X_t) \, \mathrm{d}t + \sigma_t \, \mathrm{d}W_t\) (such as the reverse-time SDE in diffusion models), we must account for the stochastic term. The natural extension of the Euler method for SDEs is the Euler-Maruyama method:

\[X_{t + \Delta t} = X_t + u_t(X_t) \Delta t + \sigma_t \sqrt{\Delta t} \, z\]

where \(z \sim \mathcal{N}(0, I)\) is standard Gaussian noise, representing the increment of the Wiener process (\(\Delta W_t \sim \mathcal{N}(0, \Delta t I)\) ).

This injected noise gives diffusion model sampling its characteristic stochasticity, requiring many steps to ensure the distribution matches correctly.

3.7 A Unified View of Generation

Throughout this document, both flow models and diffusion models can be viewed under this unified lens. They consist of a neural network \(u_t^\theta\) with parameters \(\theta\) that parameterize a drift vector field, and a fixed diffusion coefficient \(\sigma_t\):

  • Neural network: \(u^\theta: \mathbb{R}^d \times [0,1] \to \mathbb{R}^d, \quad (X,t) \mapsto u_t^\theta(X)\)
  • Fixed noise schedule: \(\sigma_t: [0,1] \to [0,\infty), \quad t \mapsto \sigma_t\)

To obtain samples from our model (i.e., generate objects), the procedure is as follows:

  1. Initialization: \(X_0 \sim p_{\text{init}}\) ▶ Initialize with a simple known distribution, e.g., standard Gaussian.
  2. Simulation: \(\mathrm{d}X_t = u_t^\theta(X_t) \, \mathrm{d}t + \sigma_t \, \mathrm{d}W_t\) ▶ Simulate the dynamics forward in time from \(t=0\) to \(t=1\).
  3. Goal: \(X_1 \sim p_{\text{data}}\) ▶ The goal is that the final state \(X_1\) follows the complex data distribution.

A generative model with \(\sigma_t = 0\) is a Flow Model (solving an ODE), while a model with \(\sigma_t > 0\) is a Diffusion Model (solving an SDE).

The Training Methodology: The challenge is training the network \(u_t^\theta\). The overarching mathematical roadmap is:

  1. First, mathematically define a forward process (SDE or ODE flow) that connects \(p_{\text{data}}\) to \(p_{\text{init}}\), and derive what the true “target” vector field \(u_t^{\text{target}}(X)\) must be to reverse or construct the generative process.
  2. Formulate a training objective using Mean Squared Error (MSE) to match the neural network to this target: \(\mathcal{L}(\theta) = \mathbb{E}_{t, X} \left[ \left\| u_t^\theta(X) - u_t^{\text{target}}(X) \right\|^2 \right]\)
  3. Because evaluating \(u_t^{\text{target}}(X)\) exactly requires marginalizing over the entire dataset, it is generally intractable.
  4. Construct a tractable proxy (a computable training target) using techniques like Denoising Score Matching or Conditional Flow Matching to make the loss optimizable.

The remainder of this post explores how Diffusion Models and Flow Matching construct their respective computable targets.

4. Constructing the Training Target

As discussed, training our generative neural network \(u_t^\theta\) requires matching it against a target vector field \(u_t^{\text{target}}\). The first step to constructing this target is by specifying a probability path.

4.1 Probability Path

The goal of generative modeling is to construct a continuous family of distributions

\[p_t(x), t \in [0,1]\]

such that

\[\begin{aligned} p_0 &= p_{\text{init}} \\ p_1 &= p_{\text{data}} \end{aligned}\]

Conceptually:

noise distribution

p_t

data distribution

However, directly specifying \(p_t(x)\) is usually difficult.


4.2 Conditional Probability Path

To make the construction tractable, we introduce a conditional probability path:

\[p_t(x | z)\]

where

\[z \sim p_{\text{data}}\]

The conditional path satisfies

\[\begin{aligned} p_0(\cdot|z) &= p_{\text{init}} \\ p_1(\cdot|z) &= \delta_z \end{aligned}\]

Interpretation:

For each data point \(z\), we construct a path from noise to that point.

Conceptually:

noise → z₁
noise → z₂
noise → z₃

Each data point defines one trajectory.


4.3 Marginal Probability Path

Sampling from the conditional path induces the marginal path.

Sampling procedure:

  1. Sample \(z \sim p_{\text{data}}\)
  2. Sample $$X_t \sim p_t(X_tz)$$

Then

\[X_t \sim p_t(X_t)\]

and

\[p_t(X_t) = \int p_t(X_t|z) \, p_{\text{data}}(z) \, \mathrm{d}z\]

Thus the marginal path interpolates between noise and data:

\[\begin{aligned} p_0 &= p_{\text{init}} \\ p_1 &= p_{\text{data}} \end{aligned}\]

4.4 From Probability Path to Vector Fields

Now that we have a probability path \(p_t\), we need a dynamical system that produces it.

Assume particles evolve according to an ODE:

\[\frac{\mathrm{d}X_t}{\mathrm{d}t} = u_t(X_t)\]

where

\[u_t(X_t)\]

is a vector field.

The goal is to find a vector field whose induced dynamics generate the desired distribution path.


4.5 Conditional Vector Field

For each data point \(z\), define a conditional vector field

\[u_t^{\text{target}}(X_t | z)\]

such that the ODE

\[\frac{\mathrm{d}X_t}{\mathrm{d}t} = u_t^{\text{target}}(X_t | z)\]

with initialization

\[X_0 \sim p_{\text{init}}\]

produces

\[X_t \sim p_t(\cdot | z)\]

Thus:

conditional vector field

generates

conditional probability path


4.6 Marginal Vector Field

The model does not know \(z\) during generation.

Therefore the neural network must approximate the marginal vector field

\[u_t^{\text{target}}(X_t)\]

The marginal field is defined as

\[u_t^{\text{target}}(X_t) = \int u_t^{\text{target}}(X_t | z) \frac{p_t(X_t | z) \, p_{\text{data}}(z)}{p_t(X_t)} \, \mathrm{d}z\]

Using Bayes rule:

\[p(z | X_t, t) = \frac{p_t(X_t | z) \, p_{\text{data}}(z)}{p_t(X_t)}\]

we obtain

\[u_t^{\text{target}}(X_t) = \mathbb{E}_{z \sim p(z | X_t, t)} \left[ u_t^{\text{target}}(X_t | z) \right]\]

Interpretation:

The marginal vector field is the conditional expectation of the conditional vector fields.

Derivation of the Marginal Vector Field Definition

We start from a conditional probability path (p_t(x\mid z)) with an associated conditional vector field (u_t(x\mid z)) satisfying the continuity equation:

\[\partial_t p_t(x\mid z) + \nabla \cdot \big(p_t(x\mid z)\,u_t(x\mid z)\big) = 0.\]

The marginal probability path is defined by

\[p_t(x) = \int p_t(x\mid z)\,p_{\text{data}}(z)\,dz.\]

Our goal is to find a marginal vector field (u_t(x)) such that the marginal density satisfies the continuity equation:

\[\partial_t p_t(x) + \nabla \cdot \big(p_t(x)\,u_t(x)\big) = 0.\]
Step 1 — Differentiate the marginal density

Differentiating the marginal density with respect to time:

\[\partial_t p_t(x) = \int \partial_t p_t(x\mid z)\,p_{\text{data}}(z)\,dz.\]

Substituting the conditional continuity equation:

\[\partial_t p_t(x\mid z) = -\nabla\cdot\big(p_t(x\mid z)\,u_t(x\mid z)\big),\]

we obtain

\[\partial_t p_t(x) = -\int \nabla\cdot\big(p_t(x\mid z)\,u_t(x\mid z)\big) \,p_{\text{data}}(z)\,dz.\]
Step 2 — Move divergence outside the integral

Since the divergence operator acts only on (x), we can exchange integration and divergence:

\[\partial_t p_t(x) = -\nabla\cdot \left( \int p_t(x\mid z)\,u_t(x\mid z)\,p_{\text{data}}(z)\,dz \right).\]
Step 3 — Match with the marginal continuity equation

The marginal continuity equation requires

\[\partial_t p_t(x) + \nabla\cdot\big(p_t(x)u_t(x)\big)=0.\]

Comparing with the previous result, we must have

\[p_t(x)\,u_t(x) = \int p_t(x\mid z)\,u_t(x\mid z)\,p_{\text{data}}(z)\,dz.\]
Step 4 — Solve for the marginal vector field

Dividing both sides by (p_t(x)) gives

\[u_t(x) = \int u_t(x\mid z)\, \frac{p_t(x\mid z)\,p_{\text{data}}(z)}{p_t(x)}\,dz.\]
Step 5 — Interpretation via Bayes rule

Note that

\[\frac{p_t(x\mid z)\,p_{\text{data}}(z)}{p_t(x)} = p(z\mid x,t).\]

Therefore the marginal vector field can be written as

\[u_t(x) = \int u_t(x\mid z)\,p(z\mid x,t)\,dz = \mathbb{E}[u_t(x\mid z)\mid x].\]
Final Result

The marginal vector field is defined as

\[u_t(x):=\int u_t(x\mid z)\,\frac{p_t(x\mid z)p_{\text{data}}(z)}{p_t(x)}\,dz\]

because this is the unique definition that ensures the marginal distribution (p_t(x)) satisfies the continuity equation and is therefore generated by the vector field (u_t(x)).


5. Diffusion Models from First Principles

In Diffusion Models, we approach the problem using the SDE framework.

5.1 Defining the Probability Path

Instead of learning the path directly, we manually define a simple forward process that destroys data into noise. A common Variance Preserving (VP) SDE is:

\[\mathrm{d}X_t = -\frac{1}{2}\beta_t \, X_t \, \mathrm{d}t + \sqrt{\beta_t} \, \mathrm{d}W_t\]

This forward process explicitly defines the conditional probability path as a Gaussian:

\[p_{t|1}(X_t | X_1) = \mathcal{N}(X_t; \, \alpha_t \, X_1, \, \sigma_t^2 \, I)\]

5.2 Denoising Score Matching (DSM)

From Section 4.2, we know our training target requires the marginal score $\nabla_X \log p_t(X)$. We can parameterize our network as $s_\theta(X_t, t)$ to predict it.

However, computing the true marginal score requires integrating over the entire dataset, which is intractable.

The DSM Identity: We replace the intractable marginal score with the tractable conditional score:

\[\nabla_{X_t} \log p_t(X_t) = \mathbb{E}_{X_1 \sim p_{1|t}(X_1|X_t)}\left[\nabla_{X_t} \log p_{t|1}(X_t|X_1)\right]\]

For our Gaussian conditional path, the conditional score is exactly computable analytically:

\[\nabla_{X_t} \log p_{t|1}(X_t | X_1) = -\frac{X_t - \alpha_t X_1}{\sigma_t^2} = -\frac{\epsilon}{\sigma_t}\]

Thus, the intractable target becomes computable, leading to the famous DDPM training objective:

\[\boxed{\mathcal{L}_{\text{DDPM}}(\theta) = \mathbb{E}_{t, X_1, \epsilon \sim \mathcal{N}(0,I)} \left[ \left\| \epsilon_\theta(\alpha_t X_1 + \sigma_t \epsilon, t) - \epsilon \right\|^2 \right]}\]

6. Flow Matching from First Principles

Flow Matching takes the continuity equation approach directly, completely bypassing SDEs and the score function.

6.1 Defining the Probability Path

Instead of defining a noisy forward SDE, Flow Matching defines the conditional probability path $p_{t|1}(X_t | X_1)$ by simply drawing a straight line from noise to data:

\[X_t = (1 - t) X_0 + t \, X_1 \quad \text{where} \quad X_0 \sim p_{\text{init}}, \, X_1 \sim p_{\text{data}}\]

This yields a Gaussian conditional probability path:

\[p_{t|1}(X_t | X_1) = \mathcal{N}(X_t; \, t \, X_1, \, (1 - t)^2 I)\]

6.2 The Target Vector Field

For this linear interpolant, the conditional vector field $u_{t|1}^*(X_t | X_1)$ that satisfies the continuity equation is simple. Taking the time derivative of $X_t$:

\[u_{t|1}^*(X_t | X_1) = \frac{\mathrm{d}X_t}{\mathrm{d}t} = X_1 - X_0\]

The target velocity is simply the straight line pointing from the noise $X_0$ to the data $X_1$.

6.3 The CFM Loss

The true marginal target vector field $u_t^{\text{target}}(X)$ is computationally intractable. Lipman et al. (2023) proved a theorem akin to Denoising Score Matching: the gradients of the Conditional Flow Matching (CFM) loss match the true Flow Matching loss.

This yields an extraordinarily simple and tractable CFM training objective:

\[\boxed{\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, X_0, X_1} \left[ \left\| u_t^\theta\big((1-t)X_0 + t \, X_1\big) - (X_1 - X_0) \right\|^2 \right]}\]

7. Summary

Starting from the goal of generative modeling:

  1. A Unified Lens: Generative generation simulates $\mathrm{d}X_t = u_t^\theta \mathrm{d}t + \sigma_t \mathrm{d}W_t$. We train $u_t^\theta$ using MSE to match a theoretical $u_t^{\text{target}}$.
  2. Conditional and Marginal Paths: The first step to finding a target is defining a conditional probability path $p_t(Xz)$ that connects noise to single data points, which induces a marginal path $p_t(X)$.
  3. Deriving the Target: The target vector field can be derived directly from the chosen marginal path by appealing to the Continuity equation (ODE) or Fokker-Planck equation (SDE).
  4. Diffusion Models: Construct a conditional path via a noise SDE, forcing the theoretical target vector field to depend on the score $\nabla \log p_t$. Denoising Score Matching makes computing this tractable.
  5. Flow Matching: Construct conditional paths using straight-line interpolations. The theoretical target velocity is just $X_1 - X_0$. Conditional Flow Matching makes predicting this tractable.

Both represent elegant mathematical ways to decompose the seemingly impossible problem of mapping pure noise into complex data distributions.


References

  • Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for generative modeling. ICLR.
  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. NeurIPS.
  • Song, Y., Sohl-Dickstein, J., Kingma, D., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. ICLR.
  • Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., & Bengio, Y. (2024). Improving and generalizing flow-based generative models with minibatch optimal transport. TMLR.