Quick links: 
[[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B232/public/html/predmety/61/70/p6170206.html | Schedule]] | 
[[https://cw.felk.cvut.cz/forum/forum-1874.html|Forum]] | 
[[https://cw.felk.cvut.cz/brute/teacher/course/1595| BRUTE]] | 
[[https://cw.fel.cvut.cz/b232/courses/bev033dle/lectures | Lectures]] |
[[https://cw.fel.cvut.cz/b232/courses/bev033dle/labs/start | Labs]]


====== Lab 7: Gaussian Variational Autoencoders ======

==== Introduction ====
In this lab we will consider vanilla Gaussian VAEs (see lecture 12) and train them to generate MNIST images.  The goal is to analyse whether the generative ability of VAEs increases with the complexity of the networks used for encoding and decoding. The baseline VAE will have both the decoder and encoder implemented by networks with one fully connected layer only (i.e. without hidden layers). The extended variant will have the decoder and encoder implemented as multilayer FFNs.  The latent representation space will be the same for both variants.

We recommend you the paper  “Tutorial on Variational Autoencoders” by C. Doersch [[https://arxiv.org/abs/1606.05908|arXiv:1606.05908]] for additional reading. 

==== Model ====
1. The space of MNIST images is $\mathcal{X} = \mathbb{R}^{28\times 28}$. The latent space is denoted as $\mathcal{Z} = \mathbb{R}^m$.

2. The decoder $d_\theta(z)$ maps $z \mapsto \mu_\theta(z) \in \mathcal{X}$ and the related probability distribution 
$p_\theta(x | z)$ is $\mathcal{N}(\mu_\theta(z), \sigma^2\mathbb{I})$, 
where we assume that the scalar $\sigma$ is either fixed or a trainable parameter.

3. The encoder $e_\varphi(x)$ maps $x \mapsto (\mu_\varphi(x), \sigma_\varphi(x)) \in (\mathcal{Z}, \mathcal{Z})$ and the related probability distribution $q_\varphi(z | x)$ is $\mathcal{N}\bigl(\mu_\varphi(x), \mathrm{diag}(\sigma_\varphi^2(x))\bigr)$.

==== Assignment 1 (4p) ====
1. Implement the FFN encoder and decoder as PyTorch Module containers. E.g. the baseline encoder like so <code python>
class Encoder(nn.Module):
    def __init__(self, zdim):
        super(Encoder, self).__init__()
        # construct the network
        self.zdim = zdim
        self.net = nn.Sequential()
        self.net.append(nn.Linear(784, self.zdim * 2))

    def forward(self, x):
        scores = self.net(x)
        mu, sigma = torch.split(scores, self.zdim, dim=1)
        sigma = torch.exp(sigma)
        return mu, sigma
</code>
Similarly, the baseline decoder like so  <code python>
class Decoder(nn.Module):
    def __init__(self, zdim):
        super(Decoder, self).__init__()
        # construct the network
        self.zdim = zdim
        self.net = nn.Sequential()
        self.net.append(nn.Linear(self.zdim, 784))
        # if you learn the sigma of the decoder 
        self.logsigma = torch.nn.Parameter(torch.ones(1))

    def forward(self, x):
        mu = self.net(x) 
        return mu
</code>

2. Implement the learning step for the VAE. Thanks to the PyTorch developer community, this is pretty easy if you use ''torch.distributions''. Below we show the code for a VAE module (you may use it if you like it) <code python>
class VAE(nn.Module):
    def __init__(self, zdim, stepsize):
        super(VAE, self).__init__()
        self.decoder = Decoder(zdim)
        self.encoder = Encoder(zdim)
        self.optimizer = torch.optim.Adam(self.parameters(), lr=stepsize)
    
    def learn_step(self, x):
        self.optimizer.zero_grad()
        # apply encoder q(z|x)
        z_mu, z_sigma = self.encoder(x)
        qz = dstr.Normal(z_mu, z_sigma)
        # sample with re-parametrization
        z = qz.rsample()
        # apply decoder p(x|z)
        x_mu = self.decoder(z)
        px = dstr.Normal(x_mu, torch.exp(self.decoder.logsigma))
        # prior p(z)
        pz = dstr.Normal(torch.zeros_like(z_mu), torch.ones_like(z_mu))
        # learn
        logx = px.log_prob(x)
        logx = logx.mean(0).sum()
        # KL-Div term
        kl_div = dstr.kl_divergence(qz, pz).mean(0).sum()
        nelbo = kl_div - logx
        nelbo.backward()
        self.optimizer.step()

        return nelbo.detach()
</code>


==== Assignment 2 (2p) ====
Choose a reasonable dimension $m$ of the latent space $\mathcal{Z} = \mathbb{R}^m$. Train the baseline VAE and the deeper VAE on MNIST data. Recall that the dimension of the latent space should be the same for both models. For each of the models report the following
  * the number of its parameters. You can get it like so <code python>
sum(p.numel() for p in vae.parameters() if p.requires_grad)</code>
  * the learning curves for ELBO.
  * a tableau of reconstructed images obtained as follows. Let $x$ be a small batch of MNIST images (say 16 images). Apply the encoder $(\mu_\varphi(x), \sigma_\varphi(x)) = e_\varphi(x)$, sample $z\sim \mathcal{N}\bigl(\mu_\varphi(x), \mathrm{diag}(\sigma_\varphi^2(x))\bigr)$ and decode $\mu = d_\theta(z)$. Show the original images $x$ along with the reconstructed images $\mu$. 

==== Assignment 3 (4p) ====
The goal of this assignment is to compare the performance of the two models.  Unfortunately, it is not possible to quantify the
performance of generative models like VAEs in terms of training data log-likelihood because its estimation is not tractable.  The paper ''arXiv:1802.03446''lists and discusses 24 different surrogate metrics. Here instead, we will analyse the trained VAEs quantitatively and qualitatively.
  * **ELBO:** Compare the achieved ELBO values for the two models.
  * **Posterior collapse:** Consider a batch of MNIST images (e.g. 256 images). Apply the encoder $(\mu(x), \sigma(x)) = e_\varphi(x)$. Compute the KL-divergence between $\mathcal{N}\bigl(\mu_i(x), \sigma_i^2(x))\bigr)$ and the prior distribution $\mathcal{N}\bigl(0, 1\bigr)$ for each latent component and each image in the batch. Average them over the batch. Report the histogram of these averaged KL-divergences. How many latent components are collapsed? Compare this for both models.
  * **Evaluating the decoder:** Consider a small batch (e.g. 64) of randomly generated latent codes $z\sim\mathcal{N}(0,\mathbb{I})$. Apply the decoder to them, i.e. $\mu = d_\theta(z)$ and report the images $\mu$ in a tableaux. Compare them for both models.
  * **Limiting distribution:** Start from a small batch (e.g. 64) of latent codes $z\sim\mathcal{N}(0,\mathbb{I})$ as above. Now apply the chain ''decode'' $\rightarrow$ ''sample'' $\rightarrow$ ''encode'' $\rightarrow$ ''sample'' to this batch say for 100 times. Record the intermediate $\mu_t$ from the decoder and produce a video that shows the evolution in an 8x8 array of images.