Lab 2: Backpropagation, Computational Graph

In this lab we will get familiar with Pytorch basics:

operations with tensors
computation graph and backpropagation

We will also implement and train a simple neural network using Pytorch tensors. In subsequent labs we will use higher level Pytorch classes and methods, which essentially encapsulate these tensor operations. For simplicity, we continue using the Gaussian mixture model from Lab 1.

Use the provided template.

Part 1. Tensor basics (2p)

Tensors in Pytorch are like multidimensional numpy arrays with all standard operations available and following a very similar syntax. They also support many additional functions needed in NNs. Most importantly, whenever an operation on tensors is performed the resulting tensor also remembers from which operation it was created and what the operands were – this allows to dynamically track the computation graph and perform backpropagation. Also, the data and operations can be carried in CPU or GPU depending on the device attribute of the tensor.

We propose the following simple exercise to get acquainted with tensors. To start with,

Install Pytorch locally on your computer pytorch.org/get-started
In this lab we will need “only” the Tensor class documentation Tensor

Now a small task:

Let's start from the following code snippet defining several scalar tensors

import torch
import numpy as np
 
w = torch.tensor(1)
x = torch.tensor(2.0)
t = torch.tensor(np.float32(3))
b = torch.tensor(4, dtype = torch.float32)

Other ways of constructing tensors are detailed in creation ops.

Check which data type your tensors have by inspecting their dtype attribute. Modify the code so that all tensors would be of the type torch.float32. As a rule, for backpropagation and parameter optimization you would want this data type.

Set w.requires_grad = True (works only for a floating point tensor)
Compute $a= x+ b$, $y = {\rm max}(a w,0)$ and $l = (y-t)^2 + w^2$ using operations on tensors. Result of every operation tracks the history – computation graph but only as long as some tensors which require a gradient computation are involved. Inspect 'grad_fn' attribute of $y$, $l$ and $a$. Draw the DAG of this computation on a paper.
Compute and print derivative of $l$ w.r.t. $y$ by using torch.autograd.grad.
Compute derivative w.r.t all leaf variables using l.backward() and inspect w.grad. Note that the default behaviour is to accumulate gradients rather than populate them with “fresh” ones. They therefore need to be reset manually by e.g. w.grad=None when needed.
With the gradient computed as above, make a gradient descent step: $w = w - 0.1\nabla_w l$ using the .data attribute of the weight tensor $w$ to perform the assignment or using tensor $w$ in the no_grad() context (see torch.no_grad). What would happen if the step was implemented like w = w - 0.1*w.grad and $w$ would be subsequently used in further computations and differentiated?
In the following code
```
w = torch.tensor(1.0, requires_grad=True)
def loss(w):
    x = torch.tensor(2.0)
    b = torch.tensor(3.0)    
    a = x + b
    y = torch.exp(w)
    l = (y-a)**2
    # y/=2
    del y,a,x,b,w
    return l
loss(w).backward()
```
will the gradient in $w$ be computed correctly? Can you explain what is stored in the computation graph and why the 'del' statement is actually redundant? Can you explain why the commented out in-place operation 'y/=2' that modifies $y$ (in contrast to deleting it), leads to an error if uncommented?

Part 2. Gradient and Network Training (5p)

Using pytorch tensors only (not Modules yet) implement a neural network with input_size inputs, one hidden layer with hidden_size units and ReLU activations and, finally, the logistic regression model in the last layer, i.e. a linear transform and the logistic sigmoid function $S$. Formally, the network is specified as $$ p(y{=}1|x; \theta) = S(W_2 {\rm max} (W_1 x + b_1, 0) + b_2), $$ where $\theta$ denotes all parameters, e.g. $\theta = (W_1,b_1,W_2,b_2)$. Use only the Tensor class and its methods. As in the previous lab, $x$ represents a matrix of all data points and has size $[N \times d]$, where $N$ is the number of data points and $d$ is the network input size. Therefore hidden layer output should be of size $[N \times \rm{hidden\_size}]$.

The training loss is the negative log-likelihood (NLL) of the data: $$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i | x_i; \theta), $$ where $N$ is the size of the training set.

Note, because exponents can quickly overflow, in real applications a numerically stable implementation of logarithm of sigmoid function is needed. There are logsigmoid, log_softmax, logsumexp, nll_loss functions available in pytorch. In this lab we will not be concerned with this issue.

Step 1

Initialize $W_1$ and $b_1$ as in the first lab and $W_2$, $b_2$ randomly, e.g. uniformly in $[-1,1]$. Run the forward pass evaluating the loss for the whole training set and run the backward pass to accumulate gradients.

Step 2

We will now check that gradients indeed well approximate the function behaviour when the weights are changed slightly from their original values. Consider varying a parameter vector $w \in \mathbb{R}^n$. The model has several parameter vectors, considering one at a time will help to isolate errors. The following method is explained in detail in the solved Assignment 1 (Gradient checking) in examples.pdf). We will compare the analytically computed derivative and a numerically computed derivative. For this purpose we create a randomized test, checking that directional derivatives in some random direction match. Let $u \in \mathbb{R}^n$, $\|u\| = 1$ be a random direction. It can be generated by generating a vector from ${\rm Uniform}[-1,1]^n$ and normalizing it. Then the directional derivative along $u$ can be computed numerically using the symmetric finite difference: $$ g(\varepsilon) = \frac{\mathcal{L}(w + \varepsilon u) - \mathcal{L}(w - \varepsilon u)}{2 \varepsilon}, $$ where $\varepsilon \ll 1$. The true directional derivative is the limit $$ g = \lim_{\varepsilon \rightarrow 0+} g(\varepsilon). $$ If we have correctly computed the gradient $\nabla_w \mathcal{L}$, there should hold $g = \langle \nabla_w \mathcal{L}, u \rangle$. We therefore expect $$ |g(\varepsilon) - \langle \nabla_w \mathcal{L}, u \rangle| = O(\varepsilon) \tag{1}. $$ The $O(\varepsilon)$ notation means that as $\varepsilon$ approaches zero, the absolute difference on the left hand side is asymptotically $c \varepsilon$ for some constant $c$. The step size $\varepsilon$ should be chosen much smaller than $1$ (order of coordinates of $w$ at initialization) but large enough so that the numerical precision in the finite difference is still sufficient. Try to verify that this limit holds numerically by testing with $\varepsilon=10^{-2}, 10^{-3}, 10^{-4}$.

Report the results of your gradient verification (1) for all model parameters. Example for $\varepsilon=10^{-4}$, 'torch.float32', hdim=500:

Grad error in w1: 0.0007544
Grad error in b1: 0.000235
Grad error in w2: 0.0004883
Grad error in b2: 9.517e-05

Same with $\varepsilon=10^{-5}$ and 'torch.float64' data type:

Grad error in w1: 6.669e-11
Grad error in b1: 5.313e-11
Grad error in w2: 4.548e-11
Grad error in b2: 2.979e-11

Step 3. Network training

Implement gradient descent step with constant step length $\varepsilon$ in all network parameters $\theta$: $$ \theta^{t+1} = \theta^{t} - \varepsilon \nabla_\theta\mathcal{L}(\theta^t). $$

Train the network, by performing several gradient descent steps, following the template. Verify that the training loss improves during the training.

Train a network with hidden_size = 5 on a training set of $200$ points with learning rate 0.1 for 100 epochs (set in the template).
Plot the resulting classifier decision boundary with the help of G2Model.plot_predictor the “interface” Predictor class in the template.
Repeat the experiment increasing the number of hidden units to 10, 100, 500. What we want to observe is that the classifier decision boundary fits the data better but stays smooth despite the abundant amount of parameters.

Draw a test set from the ground truth model. Compute and report the empirical estimate of the test error and the generalization gap (difference between training and test error). Use the Hoeffding inequality (see lecture 2: Example 1) to select the test set size so that the probability of being off in the estimate of the test accuracy by more than 1% is less than 0.01.

Report: test set size, test error rate and classification boundary plots for hidden_size = 5, 100, 500. Example classification boundary for 500 hidden units:

Part 3. Test Error (3p)

In this part we will inspect in more detail the randomness of the test error and use Chebyshev and Hoeffding inequalities to construct confidence intervals. Let us fix the test set size to $m=1000$.

The result of evaluation on a randomly drawn test set of this size is random. Repeat generating the test set and evaluation accuracy of your model 10000 times and record all evaluation results. Plot the histogram of the empirical test error. This gives an idea, how the measured test performance may fluctuate due to the randomness of the test set selection, although it is from the true distribution.

Now suppose we have only a single fixed test set $T$ of size $m$. We want to estimate the confidence intervals on the empirical test error $R_T$ without knowing the true distribution as above. In the Hoeffding inequality, set the confidence level $\alpha=0.95$, i.e. the desired probability that the empirical risk is within $\varepsilon$ from the true risk. Solve for $\varepsilon$ given $\alpha$ and $m$. Include the formula into the report. Print the test error in the format $R_T \pm \varepsilon$. Optionally, graphically display the symmetric confidence interval $[R_T-\varepsilon, R_T+\varepsilon]$ on top of the above histogram.

Derive the confidence interval (find $\varepsilon$) from Chebyshev inequality similarly to the above. Include the formula into the report. Estimate the variance $v$ of test errors as follows. Let $e_i = 1$ if the network makes an error in classifying test point $x_i$ and zero otherwise. The variance $v$ can be estimated as the sample variance of all $e_i$. Print (and optionally display graphically) the confidence interval.

Reference, including more methods:

Table of Contents