Variational autoencoder

In machine learning, a variational autoencoder,^[1] also known as VAE, is the artificial neural network architecture introduced by and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.

It is often associated with the autoencoder^[2]^[3] model because of its architectural affinity, but there are significant differences both in the goal and in the mathematical formulation. Variational autoencoders are meant to compress the input information into a constrained multivariate latent distribution (encoding) to reconstruct it as accurately as possible (decoding). Although this type of model was initially designed for unsupervised learning,^[4]^[5] its effectiveness has been proven in other domains of machine learning such as semi-supervised learning^[6]^[7] or supervised learning.^[8]

Architecture[]

In a variational autoencoder, the input data is sampled from a parametrized distribution (the prior, in Bayesian inference terms), and the encoder and decoder are trained jointly such that the output minimizes a reconstruction error in the sense of the Kullback-Leibler divergence between the parametric posterior and the true posterior.^[9]^[10]^[11]

Formulation[]

The basic scheme of a variational autoencoder. The model receives

\mathbf {x}

as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces

\mathbf {x'}

as similar as possible to

\mathbf {x}

.

From a formal perspective, given an input dataset $\mathbf {x}$ characterized by an unknown probability function $P(\mathbf {x} )$ and a multivariate latent encoding vector $\mathbf {z}$ , the objective is to model the data as a distribution $p_{\theta }(\mathbf {x} )$ , with $\theta$ defined as the set of the network parameters.

It is possible to formalize this distribution as

p_{\theta }(\mathbf {x} )=\int _{\mathbf {z} }p_{\theta }(\mathbf {x,z} )\,d\mathbf {z}

where $p_{\theta }$ is the evidence of the model's data with marginalization performed over unobserved variables and thus $p_{\theta }(\mathbf {x,z} )$ represents the joint distribution between input data and its latent representation according to the network parameters $\theta$ .

According to the chain rule, the equation can be rewritten as

p_{\theta }(\mathbf {x} )=\int _{\mathbf {z} }p_{\theta }(\mathbf {x\mid z} )p_{\theta }(\mathbf {z} )\,d\mathbf {z}

In the vanilla variational autoencoder we assume $\mathbf {z}$ with finite dimension and that $p_{\theta }(\mathbf {x|z} )$ is a Gaussian distribution, then $p_{\theta }(\mathbf {x} )$ is a mixture of Gaussian distributions.

It is now possible to define the set of the relationships between the input data and its latent representation as

Prior $p_{\theta }(\mathbf {z} )$
Likelihood $p_{\theta }(\mathbf {x} \mid \mathbf {z} )$
Posterior $p_{\theta }(\mathbf {z} \mid \mathbf {x} )$

Unfortunately, the computation of $p_{\theta }(\mathbf {x} )$ is very expensive and in most cases even intractable. To speed up the calculus and make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as

q_{\Phi }(\mathbf {z\mid x} )\approx p_{\theta }(\mathbf {z\mid x} )

with $\Phi$ defined as the set of real values that parametrize $q$ .

In this way, the overall problem can be easily translated into the autoencoder domain, in which the conditional likelihood distribution $p_{\theta }(\mathbf {x} \mid \mathbf {z} )$ is carried by the probabilistic decoder, while the approximated posterior distribution $q_{\Phi }(\mathbf {z} \mid \mathbf {x} )$ is computed by the probabilistic encoder.

ELBO loss function[]

As in every deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through backpropagation.

For variational autoencoders the idea is to jointly minimize the generative model parameters $\theta$ to reduce the reconstruction error between the input and the output of the network, and $\Phi$ to have $q_{\Phi }(\mathbf {z\mid x} )$ as close as possible to $p_{\theta }(\mathbf {z} \mid \mathbf {x} )$ .

As reconstruction loss, mean squared error and cross entropy represent good alternatives.

As distance loss between the two distributions the reverse Kullback–Leibler divergence $D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z\mid x} ))$ is a good choice to squeeze $q_{\Phi }(\mathbf {z\mid x} )$ under $p_{\theta }(\mathbf {z} \mid \mathbf {x} )$ .^[1]^[12]

The distance loss just defined is expanded as

{\begin{aligned}D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z\mid x} ))&=\int q_{\Phi }(\mathbf {z\mid x} )\log {\frac {q_{\Phi }(\mathbf {z\mid x} )}{p_{\theta }(\mathbf {z\mid x} )}}\,d\mathbf {z} \\&=\int q_{\Phi }(\mathbf {z\mid x} )\log {\frac {q_{\Phi }(\mathbf {z\mid x} )p_{\theta }(\mathbf {x} )}{p_{\theta }(\mathbf {z,x} )}}\,d\mathbf {z} \\&=\int q_{\Phi }(\mathbf {z\mid x} )\left(\log(p_{\theta }(\mathbf {x} ))+\log {\frac {q_{\Phi }(\mathbf {z\mid x} )}{p_{\theta }(\mathbf {z,x} )}}\right)d\mathbf {z} \\&=\log(p_{\theta }(\mathbf {x} ))+\int q_{\Phi }(\mathbf {z\mid x} )\log {\frac {q_{\Phi }(\mathbf {z\mid x} )}{p_{\theta }(\mathbf {z,x} )}}\,d\mathbf {z} \\&=\log(p_{\theta }(\mathbf {x} ))+\int q_{\Phi }(\mathbf {z\mid x} )\log {\frac {q_{\Phi }(\mathbf {z\mid x} )}{p_{\theta }(\mathbf {x\mid z} )p_{\theta }(\mathbf {z} )}}\,d\mathbf {z} \\&=\log(p_{\theta }(\mathbf {x} ))+E_{\mathbf {z} \sim q_{\Phi }(\mathbf {z\mid x} )}(\log {\frac {q_{\Phi }(\mathbf {z\mid x} )}{p_{\theta }(\mathbf {z} )}}-\log(p_{\theta }(\mathbf {x\mid z} )))\\&=\log(p_{\theta }(\mathbf {x} ))+D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z} ))-E_{\mathbf {z} \sim q_{\Phi }(\mathbf {z\mid x} )}(\log(p_{\theta }(\mathbf {x\mid z} )))\end{aligned}}

At this point, it is possible to rewrite the equation as

\log(p_{\theta }(\mathbf {x} ))-D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z\mid x} ))=E_{\mathbf {z} \sim q_{\Phi }(\mathbf {z\mid x} )}(\log(p_{\theta }(\mathbf {x\mid z} )))-D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z} ))

The goal is to maximize the log-likelihood of the LHS of the equation to improve the generated data quality and to minimize the distribution distances between the real posterior and the estimated one.

This is equivalent to minimize the negative log-likelihood, which is a common practice in optimization problems.

The loss function so obtained, also named evidence lower bound loss function, shortly ELBO, can be written as

L_{\theta ,\Phi }=-\log(p_{\theta }(\mathbf {x} ))+D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z\mid x} ))=-E_{\mathbf {z} \sim q_{\Phi }(\mathbf {z|x} )}(\log(p_{\theta }(\mathbf {x\mid z} )))+D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z} ))

Given the non-negative property of the Kullback–Leibler divergence, it is correct to assert that

-L_{\theta ,\Phi }=\log(p_{\theta }(\mathbf {x} ))-D_{KL}(q_{\Phi }(\mathbf {z\mid x} )\parallel p_{\theta }(\mathbf {z\mid x} ))\leq \log(p_{\theta }(\mathbf {x} ))

The optimal parameters are the ones that minimize this loss function. The problem can be summarized as

$\theta ^{*},\Phi ^{*}={\underset {\theta ,\Phi }{\operatorname {argmin} }}\,L_{\theta ,\Phi }$

The main advantage of this formulation relies on the possibility to jointly optimize with respect to parameters $\theta$ and $\Phi$ .

Before applying the ELBO loss function to an optimization problem to backpropagate the gradient, it is necessary to make it differentiable by applying the so-called reparameterization trick to remove the stochastic sampling from the formation, and thus making it differentiable.

Reparameterization trick[]

The scheme of the reparameterization trick. The randomness variable

\mathbf {\varepsilon }

is injected into the latent space

\mathbf {z}

as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.

To make the ELBO formulation suitable for training purposes, it is necessary to introduce a further minor modification to the formulation of the problem and as well as to the structure of the variational autoencoder.^[1]^[13]^[14]

Stochastic sampling is the non-differentiable operation through which it is possible to sample from the latent space and feed the probabilistic decoder.

In order to make the application of backpropagation processes feasible, such as the stochastic gradient descent, the reparameterization trick is introduced.

The main assumption about the latent space is that it can be considered as a set of multivariate Gaussian distributions, and thus can be described as

\mathbf {z} \sim q_{\Phi }(\mathbf {z} \mid \mathbf {x} )={\mathcal {N}}({\boldsymbol {\mu }},{\boldsymbol {\sigma }}^{2})

The scheme of a variational autoencoder after the reparameterization trick.

Given ${\boldsymbol {\varepsilon }}\sim {\mathcal {N}}(0,{\boldsymbol {I}})$ and $\odot$ defined as the element-wise product, the reparameterization trick modifies the above equation as

\mathbf {z} ={\boldsymbol {\mu }}+{\boldsymbol {\sigma }}\odot {\boldsymbol {\varepsilon }}.

Thanks to this transformation, that can be extended also to other distributions different from the Gaussian, the variational autoencoder is trainable and the probabilistic encoder has to learn how to map a compressed representation of the input into the two latent vectors ${\boldsymbol {\mu }}$ and ${\boldsymbol {\sigma }}$ , while the stochasticity remains excluded from the updating process and is injected in the latent space as an external input through the random vector ${\boldsymbol {\varepsilon }}$ .

Variations[]

There are many variational autoencoders applications and extensions in order to adapt the architecture to different domains and improve its performance.

$\beta$ -VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for $\beta$ values greater than one. The authors demonstrate this architecture ability to generate high-quality synthetic samples.^[15]^[16]

One other implementation named conditional variational autoencoder, shortly CVAE, is thought to insert label information in the latent space so to force a deterministic constrained representation of the learned data.^[17]

Some structures directly deal with the quality of the generated samples^[18]^[19] or implement more than one latent space to further improve the representation learning.^[20]^[21]

Some architectures mix the structures of variational autoencoders and generative adversarial networks to obtain hybrid models with high generative capabilities.^[22]^[23]^[24]

References[]

^ ^a ^b ^c Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
^ Kramer, Mark A. (1991). "Nonlinear principal component analysis using autoassociative neural networks". AIChE Journal. 37 (2): 233–243. doi:10.1002/aic.690370209.
^ Hinton, G. E.; Salakhutdinov, R. R. (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID 16873662. S2CID 1658773.
^ Dilokthanakul, Nat; Mediano, Pedro A. M.; Garnelo, Marta; Lee, Matthew C. H.; Salimbeni, Hugh; Arulkumaran, Kai; Shanahan, Murray (2017-01-13). "Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders". arXiv:1611.02648 [cs.LG].
^ Hsu, Wei-Ning; Zhang, Yu; Glass, James (December 2017). "Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation". 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 16–23. arXiv:1707.06265. doi:10.1109/ASRU.2017.8268911. ISBN 978-1-5090-4788-8. S2CID 22681625.
^ Ehsan Abbasnejad, M.; Dick, Anthony; van den Hengel, Anton (2017). Infinite Variational Autoencoder for Semi-Supervised Learning. pp. 5888–5897.
^ Xu, Weidi; Sun, Haoze; Deng, Chao; Tan, Ying (2017-02-12). "Variational Autoencoder for Semi-Supervised Text Classification". Proceedings of the AAAI Conference on Artificial Intelligence. 31 (1).
^ Kameoka, Hirokazu; Li, Li; Inoue, Shota; Makino, Shoji (2019-09-01). "Supervised Determined Source Separation with Multichannel Variational Autoencoder". Neural Computation. 31 (9): 1891–1914. doi:10.1162/neco_a_01217. PMID 31335290. S2CID 198168155.
^ An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1).
^ Khobahi, S.; Soltanalian, M. (2019). "Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding". arXiv:1911.12410 [eess.SP].
^ Kingma, Diederik P.; Welling, Max (2019). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4): 307–392. arXiv:1906.02691. doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.
^ "From Autoencoder to Beta-VAE". Lil'Log. 2018-08-12.
^ Bengio, Yoshua; Courville, Aaron; Vincent, Pascal (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (8): 1798–1828. arXiv:1206.5538. doi:10.1109/TPAMI.2013.50. ISSN 1939-3539. PMID 23787338. S2CID 393948.
^ Kingma, Diederik P.; Rezende, Danilo J.; Mohamed, Shakir; Welling, Max (2014-10-31). "Semi-Supervised Learning with Deep Generative Models". arXiv:1406.5298 [cs.LG].
^ Higgins, Irina; Matthey, Loic; Pal, Arka; Burgess, Christopher; Glorot, Xavier; Botvinick, Matthew; Mohamed, Shakir; Lerchner, Alexander (2016-11-04). "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework". Cite journal requires |journal= (help)
^ Burgess, Christopher P.; Higgins, Irina; Pal, Arka; Matthey, Loic; Watters, Nick; Desjardins, Guillaume; Lerchner, Alexander (2018-04-10). "Understanding disentangling in β-VAE". arXiv:1804.03599 [stat.ML].
^ Sohn, Kihyuk; Lee, Honglak; Yan, Xinchen (2015-01-01). "Learning Structured Output Representation using Deep Conditional Generative Models" (PDF). Cite journal requires |journal= (help)
^ Dai, Bin; Wipf, David (2019-10-30). "Diagnosing and Enhancing VAE Models". arXiv:1903.05789 [cs.LG].
^ Dorta, Garoe; Vicente, Sara; Agapito, Lourdes; Campbell, Neill D. F.; Simpson, Ivor (2018-07-31). "Training VAEs Under Structured Residuals". arXiv:1804.01050 [stat.ML].
^ Tomczak, Jakub; Welling, Max (2018-03-31). "VAE with a VampPrior". International Conference on Artificial Intelligence and Statistics. PMLR: 1214–1223. arXiv:1705.07120.
^ Razavi, Ali; Oord, Aaron van den; Vinyals, Oriol (2019-06-02). "Generating Diverse High-Fidelity Images with VQ-VAE-2". arXiv:1906.00446 [cs.LG].
^ Larsen, Anders Boesen Lindbo; Sønderby, Søren Kaae; Larochelle, Hugo; Winther, Ole (2016-06-11). "Autoencoding beyond pixels using a learned similarity metric". International Conference on Machine Learning. PMLR: 1558–1566. arXiv:1512.09300.
^ Bao, Jianmin; Chen, Dong; Wen, Fang; Li, Houqiang; Hua, Gang (2017). "CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training". pp. 2745–2754. arXiv:1703.10155 [cs.CV].
^ Gao, Rui; Hou, Xingsong; Qin, Jie; Chen, Jiaxin; Liu, Li; Zhu, Fan; Zhang, Zhao; Shao, Ling (2020). "Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning". IEEE Transactions on Image Processing. 29: 3665–3680. Bibcode:2020ITIP...29.3665G. doi:10.1109/TIP.2020.2964429. ISSN 1941-0042. PMID 31940538. S2CID 210334032.

[:0-1] Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].

[2] Kramer, Mark A. (1991). "Nonlinear principal component analysis using autoassociative neural networks". AIChE Journal. 37 (2): 233–243. doi:10.1002/aic.690370209.

[3] Hinton, G. E.; Salakhutdinov, R. R. (2006-07-28). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID 16873662. S2CID 1658773.

[4] Dilokthanakul, Nat; Mediano, Pedro A. M.; Garnelo, Marta; Lee, Matthew C. H.; Salimbeni, Hugh; Arulkumaran, Kai; Shanahan, Murray (2017-01-13). "Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders". arXiv:1611.02648 [cs.LG].

[5] Hsu, Wei-Ning; Zhang, Yu; Glass, James (December 2017). "Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation". 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 16–23. arXiv:1707.06265. doi:10.1109/ASRU.2017.8268911. ISBN 978-1-5090-4788-8. S2CID 22681625.

[6] Ehsan Abbasnejad, M.; Dick, Anthony; van den Hengel, Anton (2017). Infinite Variational Autoencoder for Semi-Supervised Learning. pp. 5888–5897.

[7] Xu, Weidi; Sun, Haoze; Deng, Chao; Tan, Ying (2017-02-12). "Variational Autoencoder for Semi-Supervised Text Classification". Proceedings of the AAAI Conference on Artificial Intelligence. 31 (1).

[8] Kameoka, Hirokazu; Li, Li; Inoue, Shota; Makino, Shoji (2019-09-01). "Supervised Determined Source Separation with Multichannel Variational Autoencoder". Neural Computation. 31 (9): 1891–1914. doi:10.1162/neco_a_01217. PMID 31335290. S2CID 198168155.

[:2-9] An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1).

[1bitVAE-10] Khobahi, S.; Soltanalian, M. (2019). "Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding". arXiv:1911.12410 [eess.SP].

[11] Kingma, Diederik P.; Welling, Max (2019). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4): 307–392. arXiv:1906.02691. doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.

[12] "From Autoencoder to Beta-VAE". Lil'Log. 2018-08-12.

[13] Bengio, Yoshua; Courville, Aaron; Vincent, Pascal (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (8): 1798–1828. arXiv:1206.5538. doi:10.1109/TPAMI.2013.50. ISSN 1939-3539. PMID 23787338. S2CID 393948.

[14] Kingma, Diederik P.; Rezende, Danilo J.; Mohamed, Shakir; Welling, Max (2014-10-31). "Semi-Supervised Learning with Deep Generative Models". arXiv:1406.5298 [cs.LG].

[15] Higgins, Irina; Matthey, Loic; Pal, Arka; Burgess, Christopher; Glorot, Xavier; Botvinick, Matthew; Mohamed, Shakir; Lerchner, Alexander (2016-11-04). "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework". Cite journal requires |journal= (help)

[16] Burgess, Christopher P.; Higgins, Irina; Pal, Arka; Matthey, Loic; Watters, Nick; Desjardins, Guillaume; Lerchner, Alexander (2018-04-10). "Understanding disentangling in β-VAE". arXiv:1804.03599 [stat.ML].

[17] Sohn, Kihyuk; Lee, Honglak; Yan, Xinchen (2015-01-01). "Learning Structured Output Representation using Deep Conditional Generative Models" (PDF). Cite journal requires |journal= (help)

[18] Dai, Bin; Wipf, David (2019-10-30). "Diagnosing and Enhancing VAE Models". arXiv:1903.05789 [cs.LG].

[19] Dorta, Garoe; Vicente, Sara; Agapito, Lourdes; Campbell, Neill D. F.; Simpson, Ivor (2018-07-31). "Training VAEs Under Structured Residuals". arXiv:1804.01050 [stat.ML].

[20] Tomczak, Jakub; Welling, Max (2018-03-31). "VAE with a VampPrior". International Conference on Artificial Intelligence and Statistics. PMLR: 1214–1223. arXiv:1705.07120.

[21] Razavi, Ali; Oord, Aaron van den; Vinyals, Oriol (2019-06-02). "Generating Diverse High-Fidelity Images with VQ-VAE-2". arXiv:1906.00446 [cs.LG].

[22] Larsen, Anders Boesen Lindbo; Sønderby, Søren Kaae; Larochelle, Hugo; Winther, Ole (2016-06-11). "Autoencoding beyond pixels using a learned similarity metric". International Conference on Machine Learning. PMLR: 1558–1566. arXiv:1512.09300.

[23] Bao, Jianmin; Chen, Dong; Wen, Fang; Li, Houqiang; Hua, Gang (2017). "CVAE-GAN: Fine-Grained Image Generation Through Asymmetric Training". pp. 2745–2754. arXiv:1703.10155 [cs.CV].

[24] Gao, Rui; Hou, Xingsong; Qin, Jie; Chen, Jiaxin; Liu, Li; Zhu, Fan; Zhang, Zhao; Shao, Ling (2020). "Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning". IEEE Transactions on Image Processing. 29: 3665–3680. Bibcode:2020ITIP...29.3665G. doi:10.1109/TIP.2020.2964429. ISSN 1941-0042. PMID 31940538. S2CID 210334032.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]