Warning

# Lab 1: Backpropagation and Overfitting

In this lab we will implement back-propagation with numpy and learn a small neural network for binary classification of 2D features (for clarity of visualization) drawn from a known Gaussian mixture model. Material of lectures 2,3.

#### Templates

Use the provided template. The package contains:

• tools.py — tools for working with the ground truth model: loading, generating, plotting the decision boundary of a classifier versus the optimal classifier.
• template.py — An example that defines and trains some simple network. This is to provide you an implementation template for modular back-propagation and to illustrate the tools, in particular visualization of a classifier. You are free to follow the pattern or to propose you own.

#### Classification Problem

Let $x$ denote observations in $\mathbb{R}^2$ and $y\in\{0,1\}$ denote the class label. In this lab we have chosen an artificial ground truth model $p(x,y)$, accessible to you through functions of the class G2Model in tools.py. You can draw a training set and a test set (function generate_sample) from this ground truth model and use them for training and validating a neural network. You can also access the optimal Bayesian classification according to the true model (function classify) and compare it to the classification by you neural network.

### Part 1. Modular Back-Propagation (4p)

Following the template pattern, implement in a modular way a neural network with input_size inputs, one hidden layer with hidden_size units and ReLU activations and, finally, the logistic regression model in the last layer, i.e. a linear transform and the logistic sigmoid function $S$. Formally, the network is specified as $$p(y{=}1|x; \theta) = S(W_2 {\rm ReLU} (W_1 x + b_1) + b_2),$$ where $\theta$ denotes all parameters, e.g. $\theta = (W_1,b_1,W_2,b_2)$. Notice that $x$ represents a matrix of all data points and has size $[N \times d]$, where $N$ is the number of data points and $d$ is the network input size. Therefore hidden layer output should be of size $[N \times \rm{hidden\_size}]$. Hint: np.einsum is very intuitive function for handling different multiplications of high-dimensional tensors.

The training loss is the negative log-likelihood (NLL) of the data: $$\mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i | x_i; \theta),$$ where $N$ is the size of the training set.

When using log conditional likelihood as the learning objective, it is numerically more stable to implement the logarithm of the sigmoid function together. If $a \in \mathbb{R}$ denotes the score, i.e. the output of the last linear layer, we get $$-\log p(y=1 \mid a) = -\log S(a) = \log(1 + e^{-a}) \\ -\log p(y=0 \mid a) = -\log(1 - S(a)) = -\log S(-a) = \log(1 + e^a).$$ Furthermore, it is convenient that the output of the network is considered to be the scores $a$ (which correspond to log odds or, more precisely, in case of the logistic sigmoid model called logits). The classification can be performed by comparing $a$ with $0$ and the training can be performed by evaluating log sigmoid probability for the target class, which is convenient to implement as a single module.

Initialize all network parameters randomly, e.g.~uniformly in $[-1,1]$.

Following the template, implement the computation of the gradient of the loss in all parameters in a modularized form by backpropagation.

Test correctness of your implementation by the following procedure. Consider varying a parameter vector $w \in \mathbb{R}^n$ (the model has several parameter vectors, considering one at a time will help to isolate errors). Keeping all other parameters fixed, compute $$\delta = \frac{\mathcal{L}(w + \varepsilon) - \mathcal{L}(w - \varepsilon)}{2}$$ for some small random $\varepsilon \in \mathbb{R}^n$. This should match the scalar product $\langle \nabla_w L, \varepsilon \rangle$. More precisely, if $\mathcal{L}$ is differentiable in $w$ in some neighborhood of $w$, we expect $$\delta - \langle \nabla_w \mathcal{L}, \varepsilon \rangle = o(\|\varepsilon\|).$$ The step size $\varepsilon$ should be chosen much smaller than $1$ (order of coordinates of $w$ at initialization) but within the numerical precision of the (double) floating point format. Decreasing $\varepsilon$ should result in a lower relative error $$(\delta - \langle \nabla_w \mathcal{L}, \varepsilon \rangle) / \|\varepsilon\|.$$

Report the results of gradient verification. Example for $\varepsilon=10^{-5}$:

Grad in W1 error: -1.1514488413817054e-12
Grad in b2 error: -6.396881400889768e-11 

### Part 2. Network Training (4p)

Implement gradient descent with constant step size to train the network following the template. Verify that the training loss improves during the training. Train a network with hidden_size = 5 on a training set of $200$ points with learning rate 0.1 for 100 epochs (set in the template). Plot the resulting classifier decision boundary with the help of G2Model.plot_predictor. Repeat the experiment increasing the number of hidden units to 10, 100, 500. What we want to observe is that the classifier decision boundary fits the data better but stays smooth despite the abundant amount of parameters.

Draw a test set from the ground truth model. Compute and report the empirical estimate of the test error and the generalization gap. Use the Hoeffding inequality (see lecture 2: Example 1) to select the test set size so that the probability of being off in the estimate of the test accuracy by more than 1% is less than 0.01.

Report: test set size, test error rate and classification boundary plots for hidden_size = 5, 100, 500. Example classification boundary for 500 hidden units:

### Part 3. Double Descent (2pt)

This assignment is independent from the back-propagation assignment. The goal is to observe the double descent phenomenon discussed in lecture 2 on our simulated data. For this will closely follow the experimental setup described by (Belkin 2018, Appendix C.2).

The neural network has the same structure with one hidden layer. However the first layer weights and biases are drawn at random and not learned. Use the following code snippet for their initialization:

W1 = (np.random.rand(hidden_size, input_size) * 2 - 1)
W1 /= np.linalg.norm(W1, axis=1).reshape(hidden_size, 1)
b1 = (np.random.rand(hidden_size) * 2 - 1) * 2
This ensures that hidden units can partition the input domain into random half-spaces that can pass more or less anywhere around the data. In this experiment, we will use the MSE loss as loss function, i.e. $$\mathcal{L}(\theta) = \frac{1}{n}\sum_i \|a_i - y_i\|^2,$$ where $a_i$ is the network score for the data point $x_i$, $y_i$ is the class label encoded as $\{-1,1\}$ (important) and $\theta = (W_2, b_2)$. In our case $W_2$ is a matrix of size $d = {\rm hidden\_size} \times 1$ and thus $\theta$ is a vector of size $d+1$. Optimizing this loss is equivalent to linear regression. Thus, we can avoid time consuming gradient descent optimization by using the closed form solution for linear regression. Recall that the solution can be computed as $$\theta = (X^{\sf T} X + \lambda I)^{-1} X^{\sf T} y,$$ where $X$ is the data matrix and $\lambda=10^{-5}$ is introduced for numerical stability. In our case the rows of the data matrix $X$ are formed by the random first-layer features ${\rm ReLU}(W_1 x_i + b_1)$, augmented with $1$ (so that the score of the model is given by $X \theta$).

Use the training set size of 40 points. Vary the number of hidden units from 1 to 100 (in steps of 2 to 10) and for each $\rm{hidden\_size}$ solve the linear regression problem as discussed above, compute its train and test MSE losses and error rates. Plot the losses and error rates versus $\rm{hidden\_size}$. The result will depend on the random initialization of the first layer. Repeat the experiment for $T=100$ trials and average the loss and error graphs over all trials. Report the averaged plots. Here is an example of averaged plots (1000 trials, training set is kept the same):