Generative Models
It is a natural proposition that the data we observe, $ x $, belongs to an underlying distribution, $ p(x) $, that unites all possible observations. Learning this distribution is a fundamental problem in generative modelling. In certain data landscapes, we may be able to theoretically derive exact expressions for p(x) based on physics or prior knowledge; take for example the position of a pendulum during oscillation, an arbitrary Gaussian distribution, or the half-life of a radioactive isotope. In other cases, data distributions are overtly complex and are not feasibly derivable. The goal of generative models is to learn the distribution of the data, $ p(x) $, from a dataset, $ X = {x_1, . . . , x_N } $, such that we can sample new data points, $ x ∼ p(x) $, that are indistinguishable from the original dataset. This is a non-trivial task, as the space of possible data distributions is vast and the data we observe is finite. This is especially so in higher dimensional spaces, such as images, audio, or text.
Latent Variable Models
Perhaps we imagine that the data we observe is generated by some underlying process that is not directly observable . Such a process may be described by a “latent” variable, z, that is not directly observed but is related to the data, perhaps in a more compressed, abstract, or semantically meaningful form. Following this idea, latent variable models seek to find lower-dimensional representations of data for which prior knowledge or structure can be imposed to make learning an underlying distribution more tractable.
VAEs and the ELBO
Kingma et al. showed that a data distribution can be modelled by the ELBO, or the evidence lower bound, which is a lower bound on the log likelihood of the data. The joint distribution of the data and latent variables is given by $ p(\bm{x}, \bm{z}) $, where $\bm{x}$ is the data and $\bm{z}$ is the latent variable. We can express the marginal likelihood of the data, $p(\bm{x})$, in two forms:
\(\begin{equation} p(\bm{x}) = \int p(\bm{x}, \bm{z}) \, d\bm{z} \end{equation}\) We wish to maximize the log-likelihood of the data, $\log p(\bm{x})$. Using this equation, it is not readily computable because all the latent variables must be integrated out, which is intractable. \(\begin{equation} p(\bm{x}) = \frac{p(\bm{x},\bm{z})}{p(\bm{z}|\bm{x})} \end{equation}\) In this formulation, we require the posterior $p(\bm{z}|\bm{x})$, which is also intractable. We then turn to the evidence lower bound (ELBO), which is a lower bound on the log likelihood of the data, $\log p(\bm{x})$: \(\begin{aligned} \log p(\bm{x}) &= \log p(\bm{x}) \\ &= \int q_\phi(\bm{z}|\bm{x}) \log p(\bm{x}) d\bm{z} \quad \left( \int q_\phi(\bm{z}|\bm{x}) d\bm{z} = 1 \right) \\ &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log p(\bm{x}) \right] \quad \text{(Definition of Expectation)} \\ &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p(\bm{x},\bm{z})}{p(\bm{z}|\bm{x})} \right] \quad \text{(Bayes' Theorem)} \\ &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p(\bm{x},\bm{z})}{q_\phi(\bm{z}|\bm{x})} \frac{q_\phi(\bm{z}|\bm{x})}{p(\bm{z}|\bm{x})} \right] \quad \text{(Multiplying and dividing by $q_\phi(\bm{z}|\bm{x})$)} \\ &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p(\bm{x},\bm{z})}{q_\phi(\bm{z}|\bm{x})} \right] - \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{q_\phi(\bm{z}|\bm{x})}{p(\bm{z}|\bm{x})} \right] \quad \text{(Linearity of Expectation)} \\ &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p(\bm{x},\bm{z})}{q_\phi(\bm{z}|\bm{x})} \right] + D_{KL} \left( q_\phi(\bm{z}|\bm{x}) || p(\bm{z}|\bm{x}) \right) \quad \text{(Definition of KL Divergence)} \\ &\geq \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p(\bm{x},\bm{z})}{q_\phi(\bm{z}|\bm{x})} \right] \quad \text{($D_{KL} \geq 0$)} \end{aligned}\) We do not have access to the true posterior $p(\bm{z}|\bm{x})$, so we instead optimize the ELBO to obtain a lower bound on the log likelihood of the data. Since the log likelihood of the data is constant, maximizing the ELBO additionally minimizes the KL divergence between the approximate posterior $q_\phi(\bm{z}|\bm{x})$ and the true posterior $p(\bm{z}|\bm{x})$.
We can dissect the ELBO term further to understand exactly what we are optimizing: \(\begin{aligned} \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p(\bm{x},\bm{z})}{q_\phi(\bm{z}|\bm{x})} \right] &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p_\theta(\bm{x}|\bm{z}) p(\bm{z})}{q_\phi(\bm{z}|\bm{x})} \right] && \text{(Bayes' Rule)} \\ &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log p_\theta(\bm{x}|\bm{z}) + \log \frac{p(\bm{z})}{q_\phi(\bm{z}|\bm{x})} \right] && \text{(Logarithm Rule)} \\ &= \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log p_\theta(\bm{x}|\bm{z}) \right] + \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log \frac{p(\bm{z})}{q_\phi(\bm{z}|\bm{x})} \right] && \text{(Linearity of Expectation)} \\ &= \underbrace{\mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log p_\theta(\bm{x}|\bm{z}) \right]}_{\text{reconstruction term}} - \underbrace{D_{KL} \left( q_\phi(\bm{z}|\bm{x}) || p(\bm{z}) \right)}_{\text{prior matching term}} && \text{(Definition of KL)} \end{aligned}\) We wish to maximize the reconstruction term and minimize the second term, albeit not all the way to zero as we want to maintain some information about the latent variables. The decoder network, $p_\theta(\bm{x}|\bm{z})$, is responsible for the reconstruction term, while the encoder network, $q_\phi(\bm{z}|\bm{x})$ is responsible for the prior matching term. The prior matching term is a regularizer that prevents the latent variables from collapsing to a single point. The prior is often chosen to be a standard normal distribution, while the approximate posterior is parameterized by a neural network that estimates a Gaussian distribution with diagonal covariance. \(\begin{aligned} q_\phi(\bm{z}|\bm{x}) &= \mathcal{N}(\bm{z}; \bm{\mu}_\phi(\bm{x}), \bm{\sigma}^2_\phi(\bm{x}) \bm{\text{I}}) \\ p(\bm{z}) &= \mathcal{N}(\bm{z}; \bm{0}, \bm{\text{I}}) \end{aligned}\) The KL-divergence between two Gaussians, in this case $q_\phi(\bm{z}|\bm{x})$ and $p(\bm{z})$, can be computed in closed form: \(\begin{aligned} D_{KL} \left( q_\phi(\bm{z}|\bm{x}) || p(\bm{z}) \right) &= \frac{1}{2} \left[ \log \frac{|\bm{\Sigma_p}|}{|\bm{\Sigma_q}|} - d + \text{tr}(\bm{\Sigma_q}^{-1}\bm{\Sigma_p}) + (\bm{\mu}_q - \bm{\mu}_p)^\top \bm{\Sigma_q^{-1}} (\bm{\mu}_q - \bm{\mu}_p) \right] \end{aligned}\) While the reconstruction term must be estimated via sampling: \(\begin{aligned} \mathbb{E}_{q_\phi(\bm{z}|\bm{x})} \left[ \log p_\theta(\bm{x}|\bm{z}) \right] &\approx \frac{1}{L} \sum_{l=1}^L \log p_\theta(\bm{x}|\bm{z}^{(l)}) \\ \bm{z}^{(l)} &\sim q_\phi(\bm{z}|\bm{x}) \end{aligned}\) However, as we know, $q_\phi(\bm{z}|\bm{x})$ is a Gaussian distribution, so it is stochastic and generally non-differentiable. We can use the reparameterization trick to make the sampling operation differentiable: \(\begin{aligned} \bm{z} &= \bm{\mu}_\phi(\bm{x}) + \bm{\sigma}_\phi(\bm{x}) \odot \bm{\epsilon} \\ \bm{\epsilon} &\sim \mathcal{N}(\bm{\epsilon}; \bm{0}, \bm{\text{I}}) \end{aligned}\)
Under this formulation, gradients can be backpropagated through the encoder network, which is essential for training. Together, the encoder and decoder networks form a variational autoencoder (VAE). Note that the derivation seen here is an adaptation of the original derivation by Kingma et al. and a simplification of its representation by Luo et al.
Extending the ELBO for a HVAE
A hierarchical VAE (HVAE) stacks multiple VAE models on top of one another in such a manner that suggests that latent variables themselves are derived from other higher-level latent spaces. The ELBO can easily be extended for such models. The joint distribution of the data and latent variables is given by: \(\begin{aligned} p(\bm{x}, \bm{z}_{1:T}) &= p(\bm{x}, \bm{z}_1, \bm{z}_2, \ldots, \bm{z}_T) && &&\\ &= p(\bm{z_T}) p_\theta(\bm{x}|\bm{z}_1) \prod_{t=2}^T p_\theta(\bm{z}_t|\bm{z}_{t-1}) && \text{($p(\bm{x},\bm{z})=p(\bm{z})p(\bm{x}|\bm{z})$)} \end{aligned}\) The posterior is given by: \(\begin{aligned} q_\phi(\bm{z}_{1:T}|\bm{x}) &= q_\phi(\bm{z}_1|\bm{x}) \prod_{t=2}^T q_\phi(\bm{z}_t|\bm{z}_{t-1}) \end{aligned}\) We can then derive the ELBO for a HVAE: \(\begin{aligned} \log p(\bm{x}) &= \log \int p(\bm{x}, \bm{z}_{1:T}) \, d\bm{z}_{1:T} && \text{(Marginalization)} && \\ &= \log \int p(\bm{x}, \bm{z}_{1:T}) \frac{q_\phi(\bm{z}_{1:T}|\bm{x})}{q_\phi(\bm{z}_{1:T}|\bm{x})} \, d\bm{z}_{1:T} && \text{$\left(\frac{q_\phi(\bm{z}_{1:T}|\bm{x})}{q_\phi(\bm{z}_{1:T}|\bm{x})}=1 \right)$} && \\ &= \log \mathbb{E}_{q_\phi(\bm{z}_{1:T}|\bm{x})} \left[ \frac{p(\bm{x}, \bm{z}_{1:T})}{q_\phi(\bm{z}_{1:T}|\bm{x})} \right] && \text{(Expectation)} && \\ &\geq \mathbb{E}_{q_\phi(\bm{z}_{1:T}|\bm{x})} \left[ \log \frac{p(\bm{x}, \bm{z}_{1:T})}{q_\phi(\bm{z}_{1:T}|\bm{x})} \right] && \text{(Jensen's Inequality)} \end{aligned}\) We can similarly derive the ELBO with only one latent variable in this same manner (using Jensen’s Inequality). In fact, we already have - the KL-divergence term is exactly the variational gap that results from Jensen’s Inequality.
Applying the ELBO to a Markovian HVAE
Consider a restricted class of HVAEs with three strict contingencies. For notation, all latent variables are denoted by $\bm{x}_{0:T}$, where $\bm{x}_0$ is the original data distribution.
- All latent spaces are of the same dimensionality.
-
The forward posterior is a Markovian and pre-defined Gaussian process where $q(\bm{x}_t | \bm{x}_{t-1}, \bm{x}_0) = q(\bm{x}_t | \bm{x}_{t-1})$. |
- The prior, $p(\bm{x}_T)$, is a standard normal distribution.
These restrictions allow us to re-formulate the joint and posterior distributions: \(\begin{aligned} p(\bm{x}_{0:T}) &= p(\bm{x}_T) \prod_{t=1}^T p_\theta(\bm{x}_{t-1}|\bm{x}_t) \\ q(\bm{x}_{1:T}|\bm{x}_0) &= \prod_{t=1}^T q(\bm{x}_t|\bm{x}_{t-1}) \end{aligned}\) Given a properly formulated forward Gaussian process, each $x_t \sim q(\bm{x}t|\bm{x}{t-1})$ is a Gaussian centered at $\bm{x}_{t-1}$. This carries through from the original datapoint, $\bm{x}_0$ as the sum of Gaussian noise is also Gaussian (the mean is the sum of the means and the variance is the sum of the variances).
Given that the forward process also has the Markov property, we can express the forward posterior using Bayes’ Rule: \(\begin{aligned} q(\bm{x}_t|\bm{x}_{t-1}) = q(\bm{x}_t|\bm{x}_{t-1}, \bm{x}_0) = \frac{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)}{q(\bm{x}_{t-1}|\bm{x}_0)} q(\bm{x}_t|\bm{x}_0) \end{aligned}\) Armed with these relations, we can move to define the ELBO as proposed by Sohl-Dickstein et al . \(\begin{aligned} \log p(\bm{x}) &= \log \int p(\bm{x}_{0:T}) \, d\bm{x}_{0:T} && \text{(Marginalization)} && \\ &= \log \int p(\bm{x}_{0:T}) \frac{q(\bm{x}_{1:T}|\bm{x}_0)}{q(\bm{x}_{1:T}|\bm{x}_0)} \, d\bm{x}_{0:T} && \text{$\left(\frac{q(\bm{x}_{1:T}|\bm{x}_0)}{q(\bm{x}_{1:T}|\bm{x}_0)}=1 \right)$} && \\ &= \log \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \frac{p(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_0)} \right] && \text{(Expectation)} && \\ &\geq \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_0)} \right] && \text{(Jensen's Inequality)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) \prod_{t=1}^T p_\theta(\bm{x}_{t-1}|\bm{x}_{t})}{\prod_{t=1}^T q(\bm{x}_{t}|\bm{x}_{t-1})} \right] && \text{(Joint and posterior)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) p_\theta(\bm{x}_0|\bm{x}_1) \prod_{t=2}^T p_\theta(\bm{x}_t|\bm{x}_{t-1})}{q(\bm{x}_1|\bm{x}_0) \prod_{t=2}^T q(\bm{x}_t|\bm{x}_{t-1})} \right] && \text{(Change of bounds)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) p_\theta(\bm{x}_0|\bm{x}_1) \prod_{t=2}^T p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_1|\bm{x}_0) \prod_{t=2}^T q(\bm{x}_t|\bm{x}_{t-1}, \bm{x}_0)} \right] && \text{(Bayes' Rule)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) p_\theta(\bm{x}_0|\bm{x}_1)}{q(\bm{x}_1|\bm{x}_0)} + \sum_{t=2}^T \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_t|\bm{x}_{t-1}, \bm{x}_0)} \right] && \text{(Logarithm Rule)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) p_\theta(\bm{x}_0|\bm{x}_1)}{q(\bm{x}_1|\bm{x}_0)} + \sum_{t=2}^T \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{\frac{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)q(\bm{x}_t|\bm{x}_0)}{q(\bm{x}_{t-1}|\bm{x}_0)}} \right] && \text{(Bayes' Rule)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) p_\theta(\bm{x}_0|\bm{x}_1)}{q(\bm{x}_1|\bm{x}_0)} + \log \frac{q(\bm{x}_1|\bm{x}_0)}{q(\bm{x}_T|\bm{x}_0)} + \sum_{t=2}^T \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)} \right] && \text{(Cancel middle terms)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) p_\theta(\bm{x}_0|\bm{x}_1) q(\bm{x}_1|\bm{x}_0)}{q(\bm{x}_1|\bm{x}_0) q(\bm{x}_T|\bm{x}_0)} + \sum_{t=2}^T \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)} \right] && \text{(Logarithm Rule)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T) p_\theta(\bm{x}_0|\bm{x}_1)}{q(\bm{x}_T|\bm{x}_0)} + \sum_{t=2}^T \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)} \right] && \text{(Cancel terms)} && \\ &= \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_0)} \left[ \log p_\theta(\bm{x}_0|\bm{x}_1) + \log \frac{p(\bm{x}_T)}{q(\bm{x}_T|\bm{x}_0)} + \sum_{t=2}^T \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)} \right] && \text{(Logarithm Rule)} && \end{aligned}\) \(\begin{aligned} &\geq \mathbb{E}_{q(\bm{z}_{1:T}|\bm{x}_0)} \left[ \underbrace{\log p_\theta(\bm{x}_0|\bm{z}_1)}_{L_0} - \underbrace{D_{KL} \left[ q(\bm{z}_T | \bm{x}_0) || p(\bm{z}_T) \right]}_{L_{\text{prior}}} - \sum_{t=2}^T \underbrace{D_{KL} \left( q(\bm{z}_{t-1}|\bm{z}_t, \bm{x}_0) || p_\theta(\bm{z}_{t-1}|\bm{z}_t) \right)}_{L_T} \right] \end{aligned}\)
The ELBO for such a HVAE has three interpretable terms:
- The first term is the reconstruction term: \(\begin{aligned} \mathcal{L}_0 = \mathbb{E}_{q(\bm{x}_{1}|\bm{x}_0)} \left[ \log p_\theta(\bm{x}_0|\bm{x}_1) \right] \end{aligned}\) This is analagous to the reconstruction term in a standard VAE.
- The second term is the prior matching term: \(\begin{aligned} \mathcal{L}_{\text{prior}} = \mathbb{E}_{q(\bm{x}_{T}|\bm{x}_0)} \left[ \log \frac{p(\bm{x}_T)}{q(\bm{x}_T|\bm{x}_0)} \right] \end{aligned}\) Since we have full control of the forward posterior, we can guarantee that the final posterior is a standard normal distribution. Then, this term is zero.
- The third term arrives from the Markovian formulation: \(\begin{aligned} \mathcal{L}_{T} &= \sum_{t=2}^T \mathbb{E}_{q(\bm{x}_{t}|\bm{x}_{t-1}, \bm{x}_0)} \left[ \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)} \right] \\ &= \sum_{t=2}^T \mathbb{E}_{q(\bm{x}_{t}|\bm{x}_0)} \mathbb{E}_{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)} \left[ \log \frac{p_\theta(\bm{x}_{t-1}|\bm{x}_t)}{q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0)} \right] \\ &= -\sum_{t=2}^T \mathbb{E}_{q(\bm{x}_{t}|\bm{x}_0)} \left[ D_{KL} \left( q(\bm{x}_{t-1}|\bm{x}_t, \bm{x}_0) || p_\theta(\bm{x}_{t-1}|\bm{x}_t) \right) \right] && \end{aligned}\) This is a denoising matching term. The desired denoising step, $p_\theta(\bm{x}{t-1}|\bm{x}_t)$, is learned as an approximation of the tractable forward posterior, $q(\bm{x}{t-1}|\bm{x}_t, \bm{x}_0)$. The KL divergence term is minimized, which is equivalent to maximizing the likelihood of the data.
Variational Diffusion Models
A variational diffusion model (VDM) is a variant of a Markovian HVAE that uses a diffusion process to model the forward process. Following Kingma et al. we use a change of notation in that the original data distribution is denoted by $\bm{x}$ and the latent variables are denoted by $\bm{z}$. The forward process is defined as: \(\begin{aligned} \bm{z}_t = \alpha_t \bm{x} + \sigma_t \bm{\epsilon}; \quad \bm{\epsilon} \sim \mathcal{N}(\bm{\epsilon}; \bm{0}, \bm{\text{I}}) \end{aligned}\) It becomes clearer as to why the forward process is selected to be a Gaussian process with the insight that we want $\mathcal{L}_{T}$ to be tractable. We know the resulting latent spaces will be Gaussians stemming from $\bm{x}$, and we can use Bayes’ Rule to derive the quantities to make $\mathcal{L}_T$ a tractable objective, using the closed-form KL-divergence formulation.
Noise Schedule and SNR
In the above Gaussian noising process $\alpha_t$ and $\sigma_t$ are strictly positive scalar-valued functions of $t \in [0,1]$. The $\log$-SNR is additionally defined as $\lambda = \log \frac{\alpha_t^2}{\sigma_t^2}$, where $\lambda$ is strictly monotonically decreasing in time, such that $\lambda_{\max}$ occurs at $t=0$ and $\lambda_{\min}$ occurs at $t=1$ so that $\bm{z}_T = \mathcal{N}(\bm{0}, \bm{\text{I}})$.
The forward process (or the destruction of $\bm{x}$) is commonly defined to be variance-preserving, imposing the constraint $\alpha_t^2 + \sigma_t^2 = 1$. This is a common choice, but not strictly necessary. Kingma et al. show that for successive timesteps $s$ and $t$, where $0 \leq s < t \leq 1$, the forward process, $q(\bm{z}_t|\bm{z}_s)$ is a Gaussian distribution: \(\begin{aligned} \bm{z}_t = \mathcal{N}(\alpha_{ts} \bm{z}_s, \sigma_{ts}^2 \bm{\text{I}}) \end{aligned}\) Where: \(\begin{aligned} \alpha_{ts} &= \frac{\alpha_t}{\alpha_s} \\ \sigma_{ts}^2 &= \sigma_t^2 - \alpha_{ts}^2 \sigma_s^2 \end{aligned}\) Due to the Markov property, $q(\bm{z}_t|\bm{z}_s) = q(\bm{z}_t|\bm{z}_s, \bm{x})$. We can then use Bayes’ Rule to derive the posterior $q(\bm{z}_s|\bm{z}_t, \bm{x})$: \(\begin{aligned} q(\bm{z}_s|\bm{z}_t, \bm{x}) &= q(\bm{z}_s|\bm{x}) q(\bm{z}_t|\bm{z}_s) \end{aligned}\) When given a prior, $p(x) = \mathcal{N}(\mu_A, \sigma_A^2)$, and a forward process, $p(y|x) = \mathcal{N}(ax, \sigma_B^2)$, the general form of the posterior, $p(x|y)$, is: \(\begin{aligned} p(x|y) &= \mathcal{N} (\tilde{\mu}, \tilde{\sigma}^2) \\ \tilde{\sigma}^{-2} &= \sigma_A^{-2} + a^2 \sigma_B^{-2} \\ \tilde{\mu} &= \tilde{\sigma}^{-2} (\sigma_A^{-2} \mu_A + a \sigma_B^{-2} y) \end{aligned}\) Using $q(\bm{z}_s|\bm{x})$ as the prior and $q(\bm{z}_t|\bm{z}_s)$ as the forward process, we can derive the posterior $q(\bm{z}_s|\bm{z}_t, \bm{x})$: \(\begin{aligned} q(\bm{z}_s|\bm{z}_t, \bm{x}) &= \mathcal{N} (\mu_Q(\bm{z}_t, \bm{x}; s, t), \sigma_Q^2(s, t) \bm{\text{I}}) \end{aligned}\) Where: \(\begin{aligned} \sigma_Q^2(s, t) &= \sigma_{ts}^2 \sigma_s^2 / \sigma_t^2 \\ \mu_Q(\bm{z}_t, \bm{x}; s, t) &= \frac{\alpha_{ts} \sigma_s^2}{\sigma_t^2} \bm{z}_t + \frac{\alpha_s \sigma_{ts}^2}{\sigma_t^2} \bm{x} \end{aligned}\)
The Denoising Matching Term
We revise the denoising matching term with the notation adopted by Kingma et al. We define the denoising matching term as: \(\begin{aligned} \mathcal{L}_{T} &= -\sum_{i=1}^T \mathbb{E}_{q(\bm{z}_{t}|\bm{x})} \left[ D_{KL} \left( q(\bm{z}_{s}|\bm{z}_t, \bm{x}) || p_\theta(\bm{z}_{s}|\bm{z}_t) \right) \right] \end{aligned}\) Given that $p_\theta(\bm{z}{s}|\bm{z}_t)$ and $q(\bm{z}{s}|\bm{z}_t, \bm{x}_0)$, are Gaussian distributions with equivalent variances, the KL-divergence can be computed in closed form: \(\begin{aligned} D_{KL}(q(\bm{z}_s|\bm{z}_t, \bm{x}) || p_\theta(\bm{z}_s|\bm{z}_t)) &= \frac{1}{2} \left[ \log \frac{|\bm{\Sigma_p}|}{|\bm{\Sigma_q}|} - d + \text{tr}(\bm{\Sigma_q}^{-1}\bm{\Sigma_p}) + (\bm{\mu}_p - \bm{\mu}_q)^\top \bm{\Sigma}_q^{-1} (\bm{\mu}_p - \bm{\mu}_q) \right] \\ &= \frac{1}{2} \left[ \log 1 - d + d + (\bm{\mu}_p - \bm{\mu}_q)^\top \bm{\Sigma}_q^{-1} (\bm{\mu}_p - \bm{\mu}_q) \right] \\ &= \frac{1}{2} \left[(\bm{\mu}_p - \bm{\mu}_q)^\top \bm{\Sigma}_q^{-1} (\bm{\mu}_p - \bm{\mu}_q) \right] \\ &= \frac{1}{2} \left[(\bm{\mu}_p - \bm{\mu}_q)^\top \sigma_q^{2}(t) \bm{I} (\bm{\mu}_p - \bm{\mu}_q) \right] \\ &= \frac{1}{2 \sigma_q^{2}(t)} \left[ ||\bm{\mu}_q - \bm{\mu}_p||_2^2 \right] \end{aligned}\) Using our derived expressions for $\sigma_q^{2}(t)$ and $\mu_q(\bm{z}_t, \bm{x}; s, t)$, we can progress with the derivation, where $\mu_p$ is the output of a neural network parameterized by $\theta$ that takes $\bm{z}_t$ as input: \(\begin{aligned} D_{KL}(q(\bm{z}_s|\bm{z}_t, \bm{x}) || p_\theta(\bm{z}_s|\bm{z}_t)) &= \frac{1}{2 \sigma_Q^{2}(t)} \left[ ||\bm{\mu}_Q - \bm{\mu}_{\theta}(\bm{z_t}; t)||_2^2 \right] \\ &= \frac{\sigma_t^2}{2 \sigma_{ts}^2 \sigma_s^2} \frac{\alpha_s^2 \sigma_{ts}^4}{\sigma_t^4} ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \\ &= \frac{1}{2 \sigma_{s}^2} \frac{\alpha_s^2 \sigma_{ts}^2}{\sigma_t^2} ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \\ &= \frac{1}{2 \sigma_{s}^2} \frac{\alpha_s^2 (\sigma_t^2 - \alpha_{ts}^2 \sigma_s^2)}{\sigma_t^2} ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \\ &= \frac{1}{2} \left(\frac{\alpha_s^2}{\sigma_s^2} - \frac{\alpha_t^2}{\sigma_t^2} \right) ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \\ &= \frac{1}{2} ( \text{SNR}(s) - \text{SNR}(t) ) ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \end{aligned}\) Kingma et al showed that the KL-divergence term is a function of the signal-to-noise ratio (SNR) of the forward process and the output of a denoising network. Reparameterizing $\bm{z}_t \sim q(\bm{z}_t|x)$ as $\bm{z}_t = \alpha_t \bm{x} + \sigma_t \bm{\epsilon}$, the denoising matching term can be re-written as an expectation over $\bm{\epsilon}$: \(\begin{aligned} \mathcal{L}_{T} = - \frac{1}{2} \sum_{t=1}^T \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I})} ( \text{SNR}(s) - \text{SNR}(t) ) ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \end{aligned}\) To avoid computing the loss for all $T$ timesteps, we can use a subset of timesteps to approximate the full loss: \(\begin{aligned} \mathcal{L}_{T} = - \frac{T}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I}), i \sim U\{1,T\}} ( \text{SNR}(s) - \text{SNR}(t) ) ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \end{aligned}\) Where $U{1,T}$ is a uniform distribution over the integers from 1 to $T$, $s = (i-1)/T$, and $t = i/T$. This is an unbiased Monte Carlo estimator of the full loss.
$L_\infty$: The Infinite Step Limit
As $T \to \infty$, the diffusion steps become infinitesimally small, and the forward process becomes a continuous-time diffusion process. Expressing $L_T$ as a function of $\tau = 1/T$: \(\begin{aligned} \mathcal{L}_{T} = - \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I}), t \sim U\{0,1\}} ( \frac{\text{SNR}(t-\tau) - \text{SNR}(t)}{\tau} ) ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \end{aligned}\) The infinite step limit becomes: \(\begin{aligned} \mathcal{L}_{\infty} = - \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I}), t \sim U\{0,1\}} \left( \frac{-d \text{SNR}(t)}{dt} \right) ||\bm{x} - \bm{x}_\theta(\bm{z}_t; t)||_2^2 \end{aligned}\) We can easily interchange this denoising loss function with that of a noise predictor network, which is a common practice in diffusion models. It can be easily derived that $\text{SNR}(t)||\bm{x} - \bm{x}\theta(\bm{z}_t; t)||_2^2 = ||\bm{\epsilon} - \bm{\epsilon}\theta(\bm{z}_t; t)||_2^2$. Then, the denoising loss function can be written as: \(\begin{aligned} \mathcal{L}_{\infty} = - \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I}), t \sim U\{0,1\}} \left( \frac{-d \text{SNR}(t)}{dt} \right) ||\bm{\epsilon} - \bm{\epsilon}_\theta(\bm{z}_t; t)||_2^2 (\text{SNR}(t)) \end{aligned}\) Recalling that log-SNR = $\lambda$, SNR = $\exp(\lambda)$, and $\frac{d \exp(\lambda)}{d t} = \exp(\lambda) d \lambda / dt$, we can simplify the denoising matching term to: \(\begin{aligned} \mathcal{L}_{\infty} = - \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I}), t \sim U\{0,1\}} \left[ \frac{-d \lambda}{dt} ||\bm{\epsilon} - \bm{\epsilon}_\theta(\bm{z}_t; t)||_2^2 \right] \end{aligned}\) This term is from here on referred to as the VLB objective, and is the primary objective of a VDM.
The log-SNR and its Implications
To this point we have largely left the choice of the noise schedule, which dictates the values of both $\alpha_t$ and $\sigma_t$, a quiet point. However, as shown by Kingma et al., the denoising matching term is a function of the log-SNR, $\lambda$. But what exactly is $\lambda$?
The term noise schedule is used loosely in literature to refer to the choice of $\alpha_t$ and $\sigma_t$. However, the noise schedule is more accurately defined as the choice of $\lambda$ using the continuous-time VLB objective. We assume $\lambda$ is an invertible function based on $t$, such that $\lambda = f_\lambda (t)$. Kingma et al. showed that through a change of variables we can find a more meaningful interpretation of $d \lambda / dt$ as it stands in the VLB.
During model training, $t$ is uniformly sampled from $[0,1]$, which allows us to compute $\lambda = f_\lambda (t)$. There is then a distribution over noise-levels, $p(\lambda)$ that has a cumulative density function given by $1 - f^{-1}\lambda(\lambda)$. The probability density function of $\lambda$ is the derivative of the cumulative density function, which is $p(\lambda_t) = -\frac{d}{d\lambda} (f^{-1}\lambda(\lambda)) = -dt/d\lambda$.
We can re-write the VLB as: \(\begin{aligned} \mathcal{L}_{\infty} = -\frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I}), \lambda \sim p(\lambda)} \left[ \frac{1}{p(\lambda)} ||\bm{\epsilon} - \bm{\epsilon}_\theta(\bm{z}_\lambda; \lambda)||_2^2 \right] \end{aligned}\) Which now clearly shows how the VLB objective is at the mercy of the chosen noise schedule and clarifies the role of $p(\lambda)$ as an importance sampling distribution .
The Weighted Loss
Kingma et al. showed that the ELBO-derived VLB can optionally be weighted by $w(\lambda)$, which is a function of the log-SNR: \(\begin{aligned} \mathcal{L}_{w} = \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \text{I}), \lambda \sim p(\lambda)} \left[ \frac{w(\lambda)}{p(\lambda)} ||\bm{\epsilon} - \bm{\epsilon}_\theta(\bm{z}_\lambda; \lambda)||_2^2 \right] \end{aligned}\) This is a representation that unified a VDM field which had previously been divided; between discrete and continuous derivations, between the choice of loss function, and between the choice of noise schedule. The weighted loss function shows that all previous progress in VDM literature can be reduced to a choice of noise schedule, $p(\lambda)$ and weighting function $w(\lambda)$.
Sampling and Connections with SDEs
The forward Gaussian process has a time evolution described by an SDE : \(\begin{aligned} d \bm{z} = \underbrace{f(\bm{z}, t) dt}_{\text{drift}} + \underbrace{g(t) dw}_{\text{diffusion}} \end{aligned}\) Where $f(\bm{z}, t)$ is the drift term, $g(t)$ is a scalar diffusion term, and $dw$ is a Wiener process. Anderson et al. showed that this SDE is a reversible process where: \(\begin{aligned} d \bm{z} = \left[ f(\bm{z}, t) - g^2(t) s_\theta(\bm{z}; \lambda) \right] dt + g(t) dw \end{aligned}\) Song et al. additionally showed that the reverse SDE has a deterministic ODE solution: \(\begin{aligned} d \bm{z} = \left[ f(\bm{z}, t) - \frac{1}{2} g^2(t) \bm{s}_\theta(\bm{z}; \lambda) \right] dt \end{aligned}\) To this point, we have not yet seen $\bm{s}\theta(\bm{z}; \lambda)$, which is an approximation of the score function of the data distribution, $\nabla{\bm{z}\lambda} \log p (\bm{z}\lambda)$. There is a simple connection between the score function and the denoising network, $\bm{\epsilon}\theta(\bm{z}; \lambda)$, in that they are proportional to each other: \(\begin{aligned} \bm{s}_\theta(\bm{z}; \lambda) = -\frac{\bm{\epsilon}_\theta(\bm{z}; \lambda)}{\sigma_\lambda} \end{aligned}\) A more complete derivation of the score function is given by in Appendix E.2. Additionally, $f(\bm{z}, t)$ and $g(t)$ are derivable given a variance-preserving forward process ($\alpha_t^2 + \sigma_t^2 = 1$). These terms are given by (as derived by Ho and Salimans ): \(\begin{aligned} f(\bm{z}, t) &= \frac{1}{2} \sigma_t^2 \frac{d \lambda}{dt} \bm{z}_t \\ g^2(t) &= - \sigma_t^2 \frac{d \lambda}{dt} \end{aligned}\) Where the initial $\bm{z} \sim \mathcal{N} (\bm{0}, \bm{\text{I}})$. If we substitute these terms into the reverse ODE solution, we arrive at reverse processes that are dependent on the denoising network output, $\bm{\epsilon}\theta(\bm{z}; \lambda)$ and the noise schedule, $\lambda$: \(\begin{aligned} d \bm{z} &= \left[ \frac{1}{2} \sigma_t^2 \frac{d \lambda}{dt} \bm{z}_t + \frac{1}{2} \sigma_t^2 \frac{d \lambda}{dt} \bm{s}_\theta(\bm{z}; \lambda) \right] dt \\ &= \frac{1}{2} \sigma_\lambda^2 \left[ \bm{z}_\lambda + \bm{s}_\theta(\bm{z}; \lambda) \right] d\lambda \\ &= \frac{1}{2} \sigma_\lambda \left[ \sigma_\lambda \bm{z}_\lambda - \bm{\epsilon}_\theta(\bm{z}; \lambda) \right] d\lambda \end{aligned}\) At this point, off-the-shelf solvers can be used to facilitate the sampling process. A common and effectively used solution includes the Heun sampler proposed by Karras et al., or stochastic variations of it. An interesting observation made by Kingma et al. showed that discretization errors are a result of the noise schedule, and thus in some cases using an alternative noise schedule for sampling than training can improve efficiency and quality of samples. Alternate shift and diffusion functions are used throughout literature (, ) to varying degrees of success, though the variance preserving forward process is the most common choice.
Elsewhere in the literature the discrete-form DDPM sampler proposed by Ho et al. is frequently used. This sampler makes use of Bayes’ Rule to derive $z_s \sim p(z_s | z_t)$, starting from $z_T \sim \mathcal{N}(0, \bm{\text{I}})$. Nichol and Dhariwal showed that strided reverse steps can be taken to shorten the length of the sampling chain, with near optimal results occurring at 256 steps, a drastic reduction from the 1000 steps used in the original DDPM sampler. |