====== Lab 5: VAE ======
Gaussian Variational Autoencoders


==== Introduction ====
In this lab we will consider vanilla Gaussian VAEs (see lecture 11) and train them to generate MNIST images.  The goal is to analyse whether the generative ability of VAEs increases with the complexity of the networks used for encoding and decoding. The baseline VAE will have both the decoder and encoder implemented by fully connected networks with one layer only (i.e. without hidden layers). The extended variant will have the decoder and encoder implemented as multilayer CNNs.  The latent noise space will be the same for both variants.

We recommend you the paper  “Tutorial on Variational Autoencoders” by C. Doersch [[https://arxiv.org/abs/1606.05908|arXiv:1606.05908]] for additional reading. 

==== Model ====
1. The space of MNIST images is $\mathcal{X} = \mathbb{R}^{28\times 28}$. The latent space is denoted as $\mathcal{Z} = \mathbb{R}^m$.

2. The decoder $d_\theta(z)$ maps $z \mapsto \mu_\theta(z) \in \mathcal{X}$ and the related probability distribution is 
$p_\theta(x | z) \colon \: \mathcal{N}(\mu_\theta(z), \sigma^2\mathbb{I})$, 
where we assume that the scalar $\sigma$ is fixed.

3. The encoder $e_\varphi(x)$ maps $x \mapsto \mu_\varphi(x), \sigma_\varphi(x) \in (\mathcal{Z}, \mathcal{Z})$ and the related probability distribution is $q_\varphi(z | x) \colon\: \mathcal{N}\bigl(\mu_\varphi(x), \mathrm{diag}(\sigma_\varphi^2(x))\bigr)$.

==== Assignment 1 (5p) ====
1. Implement the CNN encoder and decoder as PyTorch Module containers. E.g. the baseline encoder like so <code python>
class Encoder(nn.Module):
    def __init__(self, z_channels):
        super(Encoder, self).__init__()
        self.z_channels = z_channels

        # construct the body
        body_list = []
        cvl = nn.Conv2d(1, self.z_channels * 2, kernel_size=(28, 28)) 
        body_list.append(cvl)
        self.body = nn.Sequential(*body_list)

    def forward(self, x):
        scores = self.body(x)
        mu, sigma = torch.split(scores, self.z_channels, dim=1)
        sigma = torch.exp(sigma) + 0.001
        
        return mu, sigma
</code>
Similarly, the baseline decoder like so  <code python>
class Decoder(nn.Module):
    def __init__(self, z_channels):
        super(Decoder, self).__init__()
        # construct the body
        body_list = []
        tcvl = nn.ConvTranspose2d(z_channels, 1, kernel_size=(28, 28))
        body_list.append(tcvl)
        self.body = nn.Sequential(*body_list)

    def forward(self, x):
        mu = self.body(x)
        
        return mu
</code>

2. Implement the learning step for the VAE. Thanks to the PyTorch developer community, this is pretty easy if you use ''torch.distributions''. Below we show usefull code-snippets <code python>
# initialise a tensor of normal distributions from tensors z_mu, z_sigma
qz = torch.distributions.Normal(z_mu, z_sigma)

# compute log-probabilities for a tensor z
logz = qz.log_prob(z)

# sample from the distributions with re-parametrisation
zs = qz.rsample()

# compute KL divergence for two tensors with probability distributions
kl_div = torch.distributions.kl_divergence(qz, pz)
</code>
You may use the Adam optimiser for the gradient descent like so
<code python>
optimizer = torch.optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=stepsize)
</code>

==== Assignment 2 (3p) ====
Train the baseline VAE and the CNN-VAE on MNIST data. For each of the models report the following
  * the number of its parameters. You can get it like so <code python>
sum(p.numel() for p in encoder.parameters() if p.requires_grad)</code>
  * the learning curves for ELBO and the expected KL-Divergence $\mathbb{E}_{\mathcal{T}} D_{KL}(q_{\varphi}\bigl(z|x) || p(z)\bigr)$
  * the histogram of the componentwise KL-Divergences $\mathbb{E}_{\mathcal{T}} D_{KL}(q_{\varphi}\bigl(z_i|x) || p(z_i)\bigr)$, $i=1,\ldots,m$ for the last training epoch. This allows to see how many latent components collapsed to the prior distribution $p(z)$.
  * a tableau of images generated by trained VAE decoder.

==== Assignment 3 (2p) ====
The goal of this assignment is to compare the performance of the two models.  Unfortunately, it is not possible to quantify the
performance of generative models like VAEs in terms of training data log-likelihood because its estimation is not tractable.  The paper ''arXiv:1802.03446''lists and discusses 24 different surrogate metrics. This in fact shows, that we do not know how to measure the performance of deep generative models in a meaningful and tractable way.

One of the commonly adopted measures is the Frechet inception distance (FID). An implementation is available and can be easily installed by <code> pip install pytorch-fid </code> See <https://github.com/mseitzer/pytorch-fid> for more details. 

Compute and report the FID between MNIST test data and images generated by the decoders of your trained VAEs. For convenience, we provide a tarball with the MNIST test set as an image folder {{:courses:bev033dle:labs:lab5_vae:mnist_test.tgz| mnist-test}}.