Evidence lower bound

In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound^[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

Throughout this article, let $\mathbf {X}$ and $\mathbf {Z}$ be multivariate random variables, jointly-distributed with distribution $P$ . So, for example $P(\mathbf {X} )$ is the marginal distribution of $\mathbf {X}$ , and $P(\mathbf {Z} \mid \mathbf {X} )$ is the conditional distribution of $\mathbf {Z}$ given $\mathbf {X}$ . Then, for any distribution $Q(\mathbf {Z} )$ , we have^[1]

\log P(\mathbf {X} )\geq \mathbb {E} _{\mathbf {Z} \sim Q}\left[\log {\frac {P(\mathbf {X} ,\mathbf {Z} )}{Q(\mathbf {Z} )}}\right].

The right-hand side of this inequality is called the evidence lower bound, or ELBO. We refer to the above inequality as the ELBO inequality.

The ELBO comes up frequently in variational Bayesian methods. In that context, the random variable $\mathbf {X}$ conceptually represents observable data, the variable $\mathbf {Z}$ represents latent, unobservable data, and $P$ represents the true joint distribution of $\mathbf {X}$ and $\mathbf {Z}$ . We often wish to find an approximation of the true posterior distribution $P(\mathbf {Z} \mid \mathbf {X} )$ via a simpler, usually parametric, distribution, and this is what $Q(\mathbf {Z} )$ conceptually represents. Finding $Q$ can be framed as an optimization problem, and in that context the ELBO inequality is often used to obtain an optimization objective.

Terminology and notation[]

In the terminology of variational Bayesian methods, the distribution $P(\mathbf {X} )$ is called the evidence. Because the log function is monotonic, the ELBO inequality can be rewritten to give a lower bound for the evidence, as

P(\mathbf {X} )\geq \exp \left(\mathbb {E} _{\mathbf {Z} \sim Q}\left[\log {\frac {P(\mathbf {X} ,\mathbf {Z} )}{Q(\mathbf {Z} )}}\right]\right),

hence the name evidence lower bound. Some authors use the term evidence to mean

\log P(\mathbf {X} )

, in which case the original form of the inequality already gives a lower bound for the evidence. Some authors call

\log P(\mathbf {X} )

the log-evidence, and some use the terms evidence and log-evidence interchangeably to refer either to

P(\mathbf {X} )

or to

\log P(\mathbf {X} )

.

The ELBO is sometimes denoted $L$ , $L(Q)$ , or ${\mathcal {L}}(Q)$ , as in

{\mathcal {L}}(Q)=\mathbb {E} _{\mathbf {Z} \sim Q}\left[\log {\frac {P(\mathbf {X} ,\mathbf {Z} )}{Q(\mathbf {Z} )}}\right].

Strictly-speaking, the quantity

{\mathcal {L}}(Q)

defined this way is itself a random variable jointly-distributed with

\mathbf {X}

.

Relationship to entropy[]

The ELBO is closely related to the concepts of Shannon entropy, differential entropy, and cross-entropy. Abusing notation somewhat, we may write ${\mathcal {L}}(Q)=H(Q)-H(Q,P)$ , where $H$ represents Shannon or differential entropy depending on whether our random variables are discrete or continuous, and $H(Q,P)=H(Q(\mathbf {Z} ),P(\mathbf {X} ,\mathbf {Z} ))$ represents the cross-entropy between $Q(\mathbf {Z} )$ and $P(\mathbf {X} ,\mathbf {Z} )$ as a function of $\mathbf {X}$ .

Relationship to Kullback–Leibler divergence[]

The ELBO inequality can be derived as a consequence of the fact that KL-divergence is always non-negative. Observe that

{\begin{aligned}D_{\mathrm {KL} }(Q(\mathbf {Z} )\parallel P(\mathbf {Z} \mid \mathbf {X} ))&\triangleq \mathbb {E} _{\mathbf {Z} \sim Q}\left[\log {\frac {Q(\mathbf {Z} )}{P(\mathbf {Z} \mid \mathbf {X} )}}\right]\\&=\mathbb {E} _{\mathbf {Z} \sim Q}\left[\log {\frac {Q(\mathbf {Z} )P(\mathbf {X} )}{P(\mathbf {X} ,\mathbf {Z} )}}\right]\\&=\mathbb {E} _{\mathbf {Z} \sim Q}\left[\log P(\mathbf {X} )-\log {\frac {P(\mathbf {X} ,\mathbf {Z} )}{Q(\mathbf {Z} )}}\right]\\&=\log P(\mathbf {X} )-\mathbb {E} _{\mathbf {Z} \sim Q}\left[\log {\frac {P(\mathbf {X} ,\mathbf {Z} )}{Q(\mathbf {Z} )}}\right],\end{aligned}}

where

D_{\mathrm {KL} }

is Kullback–Leibler divergence. The desired inequality follows trivially from the above equation because

D_{\mathrm {KL} }(Q(\mathbf {Z} )\parallel P(\mathbf {Z} \mid \mathbf {X} ))\geq 0

.

Motivation and use in optimization[]

Returning to the context of variational Bayesian methods,

the task of finding a distribution $Q(\mathbf {Z} )$ approximating $P(\mathbf {Z} \mid \mathbf {X} )$ can be framed as an optimization problem where we seek to minimize some measurement of divergence between the two distributions. One such measurement, very commonly used, is KL-divergence, which is also sometimes called relative entropy. (For other quantities that measure the dissimilarity between probability distributions, see the article on divergence.) Because

{\begin{aligned}D_{\mathrm {KL} }(Q(\mathbf {Z} )\parallel P(\mathbf {Z} \mid \mathbf {X} ))&=\log P(\mathbf {X} )-\mathbb {E} _{\mathbf {Z} \sim Q}\left[\log {\frac {P(\mathbf {X} ,\mathbf {Z} )}{Q(\mathbf {Z} )}}\right]\\&=\log P(\mathbf {X} )-{\mathcal {L}}(Q),\end{aligned}}

it follows that minimizing

D_{\mathrm {KL} }(Q(\mathbf {Z} )\parallel P(\mathbf {Z} \mid \mathbf {X} ))

is equivalent to maximizing the ELBO

{\mathcal {L}}(Q)

. The quantity

{\mathcal {L}}(Q)

can be taken as a learning objective in ML tasks and architectures involving distribution approximation, for example in variational autoencoders. The reason

{\mathcal {L}}(Q)

is commonly used as an optimization target is that it can often be computed in cases where the true posterior

P(\mathbf {Z} \mid \mathbf {X} )

cannot.

References[]

^ ^a ^b Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [cs, stat].

[:0-1] Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [cs, stat].

[1]