This tutorial discusses MMD variational autoencoders, a member of the InfoVAE family. It is an alternative to traditional variational autoencoders that is fast to train, stable, and easy to implement. Moreover, traditional variational autoencoders suffer from several problems, such as uninformative latent features123, and variance over-estimation in feature space. MMD-VAE (as well as all members of the InfoVAE family) do not suffer from either problem.

# Warm-up: Variational Autoencoding

We begin with an exposition of Variational Autoencoders (VAE in short)45 from the perspective of unsupervised representation learning.

Suppose we have a collection of images, and would like to learn useful features such as object categories and other semantically meaningful attributes. However, without explicit labels or the specification of a relevant task to solve, it is unclear what constitutes a “good feature”.

One possibility is to define “good features” as features that are useful for compressing the data. Consider the following communication game. Alice has some distribution over images $x \sim p_{\mathrm{data}}(x)$ where each image has 1 million pixels. She would like to samples an image each time and send it to Bob, but she can only send 100 bits of information for each image. She has to discuss a communication protocol with Bob beforehand, so that with the very limited information Bob can reconstruct the image as “accurately” as possible.

Let $z$ be the 100 bit message Alice sends (which we term “latent feature”) and $x$ be image in pixel space. In the most general case where we allow some randomness, Alice needs to specify an encoding distribution $q_\phi(z \vert x)$ and Bob a decoding distribution $p_\theta(x \vert z)$. Alice can produce a distribution over messages, sample one, and pass to Bob, while Bob produces a distribution over possible reconstructions based on the message. In practice, Alice and Bob are often modeled by deep neural networks, where $\phi$ and $\theta$ are the parameters of the neural net.

What remains to be formalized is what we mean by “accurate” reconstruction. This is actually an important open question67, but here we assume that accurate means that the original image Alice wants to send is assigned high probability $p_\theta(x \vert z)$ by Bob conditioned on the received message (latent feature $z$).

How should we choose the encoding and decoding distributions $q_\phi$ and $p_\theta$? One possibility is to jointly optimize the following reconstruction loss for each image $x$:

This objective encourages Alice to generate good messages, so that based on the message, Bob to assign very high probability to the original image. The hope is that if Alice and Bob can accomplish this, the message (latent feature) should contain the most salient features, such as object category and main factors of variation in the data distribution.

In addition to this, we may want to manipulate the message space. For example, Alice can observe an image, generate latent features, but change only a part of it, such as certain attributes, and Bob should still be able to generate sensible output reflecting the change. Alice can even simply come up with a random valid message for Bob to generate. (So we have a generative model.) This is also important because almost always Alice can only show Bob a finite collection of images (a dataset that is very small compared to all possible images). For example, if Alice and Bob has only been trained to communicate red cars but not cars of any other color, then it should be natural that Alice should produce meaningful messages for blue cars, and Bob should produce blue cars given that message. (Generalization)

To do this we must define the space of valid messages. This can be achieved by defining a “prior” distribution of valid messages $p(z)$. $p(z)$ is usually a very simple distribution such as a uniform or Gaussian distribution. When observing all possible images from the true underlying distribution $p_{data}(x)$, the messages Alice generates is itself a distribution $q_\phi(z) = \mathbb{E}_{p_{data}(x)} [q_\phi(z\vert x)]$. We would like the message distribution Alice generates to match the distribution on valid messages $p(z)$, and Bob when observing any valid message $p(z)$, generates corresponding correct reconstructions. The number of training examples is still finite, but at least now we have explicitly defined our generalization goal, while previously it isn’t clear what we mean by messages Alice and Bob hasn’t trained on, but still “valid“. Whether this generalization will happen is a major performance metric for VAEs that is often visually verified by human judgment.

Therefore we get a generic family of variational auto-encoding criteria that generalizes the original VAE objective4

where $D$ is any strict divergence, meaning that $D(q \Vert p) \geq 0$ and $D(q \Vert p) = 0$ if and only if $q = p$, and $\lambda>0$ is a scaling coefficient.

Which divergence should one choose? Traditionally people have been using $\mathbb{E}_{p_{data}(x)}[ -\mathrm{KL}(q_\phi(z \vert x) \Vert p(z)) ]$, which is optimized to $0$ if the message Alice generates $q_\phi(z\vert x)$ for each input $x$ matches the prior $p(z)$. Then of course, the sum total over all possible inputs $q_\phi(z)$ must also match $p(z)$. It should be immediately obvious that this can be problematic in certain scenarios. It is not clear why we prefer to have Alice generate the same message for each possible input, as this works against our goal of learning good features. We will go back to this later. For now we define an alternative divergence Maximum Mean Discrepancy8, which we will show to have many advantages compared to traditional approaches.

# Maximum Mean Discrepancy

Maximum mean discrepancy (MMD)8 is based on the idea that two distributions are identical if and only if all moments are identical. Therefore a divergence can be defined if we can measure how “different” are the moments of two distributions $p(z)$ and $q(z)$. MMD is a method of efficiently doing this via the kernel trick:

where $k(z, z')$ is any universal kernel, such as Gaussian $k(z, z') = e^{-\frac{\lVert z - z' \rVert^2}{2\sigma^2}}$. If readers are not familiar with kernels, it can be intuitively interpreted as a function that measures the “similarity” of two samples. It has a large value when two samples are similar and small when they are not. For example the Gaussian kernel considers points close by in Euclidean space as “similar”. A rough intuition of MMD, then, is that if two distributions are identical, then the average “similarity” between samples from each distribution, should be identical to the average “similarity” between mixed samples from both distributions.

# Why Use an MMD Variational Autoencoder

In this section we argue in detail why we might prefer MMD-VAE over traditional evidence lower bound (ELBO)

We show that ELBO suffers from two problems that MMD-VAE does not suffer from.

### Uninformative Latent Code

People have noticed that the first term is too restrictive123. Intuitively it encourages for each $x$, the message Alice generates $q_\phi(z \vert x)$ to be a random sample from $p(z)$, which makes it completely uninformative about the input. If Bob can also generate a very complex distribution given any message, then a trivial strategy by Alice and Bob can globally maximize ELBO: Alice only produces $p(z)$ regardless of input, and Bob only produces $p_{data}(x)$ regardless of Alice’s message. This means that we have failed to learn any meaningful latent representation, which is what we designed our model to do to begin with. Even when Bob cannot generate a complex distribution (e.g. $p_\theta(x \vert z)$ is a Gaussian), this still encourages the model to under-use the latent code. Several methods have been proposed123 to alleviate this problem, but they involve additional overhead, and do not completely solve the problem.

MMD Variational Autoencoders do not suffer from this problem altogether. As shown in our paper, it always prefer to maximize mutual information between $x$ and $z$, regardless of the capabilities of Alice and Bob. provide intuition?

### Variance Over-estimation in Feature Space

Another problem with ELBO-VAE is that it tends to overfit data, and in the mean time, learn a $q_\phi(z)$ that has variance tending to infinity. As a simple example, consider training ELBO on a dataset with two datapoints $\lbrace -1, 1 \rbrace$. We prove in our paper that $\mathcal{L}_{ELBO}$ can be optimized to infinity by having

1. $\mathbb{E}[q_\phi(z \vert x=-1)] \to -\infty$ and $\mathbb{E}[q_\phi(z \vert x=1)] \to +\infty$
2. $p_\theta(x \vert z)$ is infinitely concentrated around $x=-1$ when $% $, and infinitely concentrated around $x=1$ when $z > 0$.

The problem here is that, for ELBO, the latent feature regularization (which is supposed to encourage $q_\phi(z)$ to match $p(z)$) is not strong enough. As a result, instead of matching $p(z)$, $q_\phi(z)$ is trained to have infinite variance. This can be experimentally verified and visualized: observe how mass of $q_\phi(z \vert x)$ is pushed away from $0$ (top animation) as the model overfits this small dataset (bottom animation).

We cannot simply add a large coefficient $\beta$ to $\mathrm{KL}(q_\phi(z \vert x) \Vert p(z))$. Even though adding a coefficient $\beta$ larger than 1 has been shown to improve generalization9. However this is exactly the term that encourages the model to not use the latent code. Scaling it up will make matters worse for the uninformative latent code problem. On the other hand, MMD-VAE can freely scale up the regularization term $\lambda D(q_\phi(z) \Vert p(z))$ with little negative consequences. We get much better regularized behavior when $\lambda = 500$ in this example.

This matters in practice when dataset is small compared to the difficulty of the task. For example, when training on MNIST dataset with only 500 examples, ELBO overfits badly and generates poor samples (Top), while InfoVAE generates reasonable (albeit fuzzy) samples (Bottom).

# Implementing a MMD Variational Autoencoder

The code for this tutorial can be downloaded here, with both python or ipython versions available. The code is fairly simple, and we will only explain the main parts below.

To efficiently compute the MMD statistics and exploit GPU parallelism, we use the following code

The first function compute_kernel(x, y) takes as input two matrices $(x\_size, dim)$ and $(y\_size, dim)$, and returns a matrix $(x\_size, y\_size)$ where the element $(i, j)$ is the outcome of applying the kernel to the $i$-th vector of $x$, and $j$-th vector of $y$. Given the matrix the we can compute the MMD statistics according to the definition. $sigma\_sqr$ controls the smoothness of the kernel function. In addition the kernel definition has a hyper-parameter $\sigma^2$. We find MMD to be fairly robust to its selection, and that using $2/dim$ a good option in every senario we experimented on.

To match samples from prior $p(z)$ and latent codes generated by the encoder, we can simply generate samples from the prior distribution $p(z)$, and compare the MMD distance between the real samples and the generated latent codes.

We suppose that the distribution Bob produces $p_\theta(x \vert z)$ is a factorized Gaussian. The mean of the Gaussian $\mu(z)$ is produced by a neural network, and the variance is simply assumed to be some fixed constant. The negative likelihood is thus proportional to the squared distance $\lVert x - \mu(z) \rVert^2$. The total loss is a sum of this negative log likelihood and the MMD distance.

Training on a Titan X for approximately one minute already gives very sensible samples

Furthermore if the latent code only has two dimensions, we can visualize the latent code produced by different digit labels. We can observe good disentangling.

# References

1. Chen, Xi, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. “Variational Lossy Autoencoder.” arXiv preprint arXiv:1611.02731 (2016).  2 3

2. Bowman, Samuel R., Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).  2 3

3. Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. “Ladder variational autoencoders.” In Advances in Neural Information Processing Systems, pp. 3738-3746. 2016.  2 3

4. Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).  2

5. Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint arXiv:1401.4082 (2014).

6. Larsen, Anders Boesen Lindbo, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. “Autoencoding beyond pixels using a learned similarity metric.” arXiv preprint arXiv:1512.09300 (2015).

7. Dosovitskiy, Alexey, and Thomas Brox. “Generating images with perceptual similarity metrics based on deep networks.” In Advances in Neural Information Processing Systems, pp. 658-666. 2016.

8. Gretton, Arthur, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J. Smola. “A kernel method for the two-sample-problem.” In Advances in neural information processing systems, pp. 513-520. 2007.  2

9. Higgins, Irina, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. “beta-vae: Learning basic visual concepts with a constrained variational framework.” (2016).