Evidence lower bound

From Wikipedia, the free encyclopedia

In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

Throughout this article, let and be multivariate random variables, jointly-distributed with distribution . So, for example is the marginal distribution of , and is the conditional distribution of given . Then, for any distribution , we have[1]

The right-hand side of this inequality is called the evidence lower bound, or ELBO. We refer to the above inequality as the ELBO inequality.

The ELBO comes up frequently in variational Bayesian methods. In that context, the random variable conceptually represents observable data, the variable represents latent, unobservable data, and represents the true joint distribution of and . We often wish to find an approximation of the true posterior distribution via a simpler, usually parametric, distribution, and this is what conceptually represents. Finding can be framed as an optimization problem, and in that context the ELBO inequality is often used to obtain an optimization objective.

Terminology and notation[]

In the terminology of variational Bayesian methods, the distribution is called the evidence. Because the log function is monotonic, the ELBO inequality can be rewritten to give a lower bound for the evidence, as

hence the name evidence lower bound. Some authors use the term evidence to mean , in which case the original form of the inequality already gives a lower bound for the evidence. Some authors call the log-evidence, and some use the terms evidence and log-evidence interchangeably to refer either to or to .

The ELBO is sometimes denoted , , or , as in

Strictly-speaking, the quantity defined this way is itself a random variable jointly-distributed with .

Relationship to entropy[]

The ELBO is closely related to the concepts of Shannon entropy, differential entropy, and cross-entropy. Abusing notation somewhat, we may write , where represents Shannon or differential entropy depending on whether our random variables are discrete or continuous, and represents the cross-entropy between and as a function of .

Relationship to Kullback–Leibler divergence[]

The ELBO inequality can be derived as a consequence of the fact that KL-divergence is always non-negative. Observe that

where is Kullback–Leibler divergence. The desired inequality follows trivially from the above equation because .

Motivation and use in optimization[]

Returning to the context of variational Bayesian methods,

the task of finding a distribution approximating can be framed as an optimization problem where we seek to minimize some measurement of divergence between the two distributions. One such measurement, very commonly used, is KL-divergence, which is also sometimes called relative entropy. (For other quantities that measure the dissimilarity between probability distributions, see the article on divergence.) Because

it follows that minimizing is equivalent to maximizing the ELBO . The quantity can be taken as a learning objective in ML tasks and architectures involving distribution approximation, for example in variational autoencoders. The reason is commonly used as an optimization target is that it can often be computed in cases where the true posterior cannot.

References[]

  1. ^ a b Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [cs, stat].
Retrieved from ""