Search
In this lab we will get familiar with Pytorch basics:
We will also implement and train a simple neural network using Pytorch tensors. In subsequent labs we will use higher level Pytorch classes and methods, which essentially encapsulate these tensor operations. For simplicity, we continue using the Gaussian mixture model from Lab 1.
Use the provided template.
Tensors in Pytorch are like multidimensional numpy arrays with all standard operations available and following a very similar syntax. They also support many additional functions needed in NNs. Most importantly, whenever an operation on tensors is performed the resulting tensor also remembers from which operation it was created and what the operands were – this allows to dynamically track the computation graph and perform backpropagation. Also, the data and operations can be carried in CPU or GPU depending on the device attribute of the tensor.
device
We propose the following simple exercise to get acquainted with tensors. To start with,
Now a small task:
import torch import numpy as np w = torch.tensor(1) x = torch.tensor(2.0) t = torch.tensor(np.float32(3)) b = torch.tensor(4, dtype = torch.float32)
Check which data type your tensors have by inspecting their dtype attribute. Modify the code so that all tensors would be of the type torch.float32. As a rule, for backpropagation and parameter optimization you would want this data type.
dtype
torch.float32
w.requires_grad = True
l.backward()
w.grad
w.grad=None
.data
w = w - 0.1*w.grad
w = torch.tensor(1.0, requires_grad=True) def loss(w): x = torch.tensor(2.0) b = torch.tensor(3.0) a = x + b y = torch.exp(w) l = (y-a)**2 # y/=2 del y,a,x,b,w return l loss(w).backward()
Using pytorch tensors implement a neural network with input_size inputs, one hidden layer with hidden_size units and ReLU activations and, finally, the logistic regression model in the last layer, i.e. a linear transform and the logistic sigmoid function $S$. Formally, the network is specified as $$ p(y{=}1|x; \theta) = S(W_2 {\rm max} (W_1 x + b_1, 0) + b_2), $$ where $\theta$ denotes all parameters, e.g. $\theta = (W_1,b_1,W_2,b_2)$. Use only the Tensor class and its methods. As in the previous lab, $x$ represents a matrix of all data points and has size $[N \times d]$, where $N$ is the number of data points and $d$ is the network input size. Therefore hidden layer output should be of size $[N \times \rm{hidden\_size}]$.
input_size
hidden_size
The training loss is the negative log-likelihood (NLL) of the data: $$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i | x_i; \theta), $$ where $N$ is the size of the training set.
Note, because exponents can quickly overflow, in real applications a numerically stable implementation of logarithm of sigmoid function is needed, as implemented in e.g. log_softmax, log_sum_exp, nll_loss functions. In this lab we will not be concerned with this issue.
log_softmax
log_sum_exp
nll_loss
Initialize $W_1$ and $b_1$ as in the first lab and $W_2$, $b_2$ randomly, e.g. uniformly in $[-1,1]$. Run the forward pass evaluating the loss for the whole training set and run the backward pass to accumulate gradients.
We will now check that gradients indeed well approximate the function behaviour when the weights are changed slightly from their original values. Consider varying a parameter vector $w \in \mathbb{R}^n$. The model has several parameter vectors, considering one at a time will help to isolate errors. The following method is explained in detail in the solved Assignment 1 (Gradient checking) in examples.pdf). Keeping all other parameters fixed, compute the symmetrized finite difference: $$ \Delta \mathcal{L} = \frac{\mathcal{L}(w + \Delta w) - \mathcal{L}(w - \Delta w)}{2} $$ for some small $\Delta w \in \mathbb{R}^n$, which may be taken as a random for the purpose of the test. More specifically let $u$ be uniform in $[-1,1]^n$ and $\Delta w = u/||u|| \varepsilon$, where $\varepsilon$ is the step length for this test of our choice (see below). If the gradient allows to approximate the function well in a small neighborhood we expect the function increment $\Delta \mathcal{L}$ to match the scalar product $\langle \nabla_w L, \Delta w \rangle$. More precisely, if $\mathcal{L}$ is differentiable in $w$, locally, we expect $$ \Delta \mathcal{L} - \langle \nabla_w \mathcal{L}, \Delta w \rangle = o(\|\Delta w\|) = o(\varepsilon). $$ The step size $\varepsilon$ should be chosen much smaller than $1$ (order of coordinates of $w$ at initialization) but within the numerical precision of the (double) floating point format. The $o$ notation means that the ratio $$ (\Delta \mathcal{L} - \langle \nabla_w \mathcal{L}, \Delta w \rangle) / \varepsilon $$ should approach zero when $\varepsilon \rightarrow 0$. With a very small $\varepsilon$ we may run into numerical problems. Nevertheless try to verify that this limit holds numerically.
Report the results of your gradient verification for all model parameters. Example for $\varepsilon=10^{-5}$:
Grad in W1 error: -1.1514488413817054e-12 Grad in b1 error: 5.97456285573448e-12 Grad in W2 error: 3.6022765019577574e-10 Grad in b2 error: -6.396881400889768e-11
Implement gradient descent step with constant step length $\varepsilon$: $\theta = \theta - \varepsilon \nabla_\theta\mathcal{L}$ (in all network parameters).
Train the network, by performing several gradient descent steps, following the template. Verify that the training loss improves during the training.
G2Model.plot_predictor
Draw a test set from the ground truth model. Compute and report the empirical estimate of the test error and the generalization gap (difference between training and test error). Use the Hoeffding inequality (see lecture 2: Example 1) to select the test set size so that the probability of being off in the estimate of the test accuracy by more than 1% is less than 0.01.
Report: test set size, test error rate and classification boundary plots for hidden_size = 5, 100, 500. Example classification boundary for 500 hidden units: