Lab 1: Double Descent + MLP

This lab has an introduction part, explaining how the study process is organized and what tools you need to work with; and a light homework part.

Preparations:

Labs and seminars format, grading, deadlines
Upload system BRUTE
FORUM for asking Teachers anything related to the course and assignments (so that everyone can see).
How To Page

Lab: toy 2D problem with random lifting and linear regression

useful inductive bias of GD for linear regression
observe double descend phenomenon
train an MLP on the same data, to see if it overfits
Measure the test performance

Part 1. Double Descent

We will work with a simulated data for binary classification of 2D features drawn from a known Gaussian mixture model. Such toy data allows to visualize all the classifiers and is already sufficient to observe several interesting phenomena. Furthermore, the synthetic setup allows to evaluate the test error and its distribution for a small test set size.

Templates

Use the provided template.

Classification Problem

Let $x$ denote observations in $\mathbb{R}^2$ and $y\in\{-1,1\}$ denote the class label. In this lab we have chosen an artificial ground truth model $p(x,y)$, accessible to you through functions of the class G2Model. You can generate a training set and a test set yourself using the function generate_sample, which samples from this ground truth model. You can also access the optimal Bayesian classification according to the true model (function classify) and compare it to the classification by you neural network.

The next plot show an example of 40 data points, the GT decision boundary and a linear classifier boundary, drawn with plot_boundary(train_data, net):

Task

The goal of this lab is to observe a phenomenon which is somewhat counterintuitive and contradicting to the classical learning theory. The phenomenon is called double descent and is described in the paper: Belkin et al. 2018. Reconciling modern machine-learning practice and the classical bias–variance trade-off. We will replicate the experiment in a simple scenario described in Appendix C.2. of the paper. The classical learning theory (in particular structural risk minimization) suggests that when increasing the model capacity, we will be eventually overfitting to the data and generalize poorly. Therefore there should be a tradeoff between the training error and the model complexity. This experiment gives one example when it is not exactly so…

We will consider a neural network with one hidden layer: \begin{align} &\phi = {\rm tanh}(W_0 x + b_0) \tag{lifting}\\ & s = w^{\sf T} \phi + b \tag{linear}, %& p(y=1|x) = {\rm sigmoid}(s) \end{align} The first layer, computing $\phi$ will be initialized randomly and will not be trained. It will represent a fixed randomized lifting of the initial 2D features into a higher-dimensional space. Let us denote input dimension as $d=2$ and hidden dimension as $D$. This lifting is provided in the `Lifting` class. Basically, it is initialized randomly but in a range relevant for the range of the dataset (a lifted feature which is constant for all data points is not useful). Due to the use of $\tanh$, each feature $\phi_k$ can be imagined as a prediction of a random linear classifier.

We want to train the second layer, i.e. the linear (affine) transform defined by $w, b$.

Let $(x_i, y_i)_{i=1}^N$ be our training data, where the class label is encoded as $\{-1,1\}$. Let $\theta = (w, b)$ be the vector of trainable parameters, of the size $D + 1$. We will use the MSE loss as the training loss function, i.e. \begin{align} \mathcal{L}(\theta) = \frac{1}{n}\sum_i \|s_i - y_i\|^2, \end{align}

This is a simplification, treating the problem as a regression in the lifted space. The choice of fixed random features and MSE loss allows us to solve for the optimal parameters in a closed form instead of performing gradient descent.

Let $\Phi$ be the data matrix of size $N \times (D + 1)$ with columns $(\phi(x_i), 1)$, i.e. our features augmented with constant $1$. Let $y$ be the vector of labels. With this notation we can write the loss as $$ \mathcal{L}(\theta) = \frac{1}{n} \| \Phi\theta - y\|^2 = \frac{1}{n} (\Phi\theta - y)^{\sf T}(\Phi\theta - y) = \frac{1}{n}\Big( \theta^{\sf T}\Phi^{\sf T} \Phi \theta - 2\theta^{\sf T}\Phi^{\sf T} y + y^{\sf T} y\Big). $$ Note that $\Phi^{\sf T} \Phi$ is a $(D+1) \times (D+1)$ matrix. The condicions for the critical point are: $$ \Phi^{\sf T} \Phi \theta = \Phi^{\sf T} y. $$ If we have at least $D$ training data points in a general position, the matrix $\Phi^{\sf T} \Phi$ is invertible and we can find the unique minimum as $$ \theta = (\Phi^{\sf T} \Phi)^{-1} \Phi^{\sf T} y. $$ If we have less data points, there is a linear subspace of optimal solutions. The solution with the smallest norm is given by \begin{align} \theta = (\Phi^{\sf T} \Phi)^{\dagger} \Phi^{\sf T} y, \end{align} where $A^\dagger$ is the pseudo-inverse of $A$. Important for us is that gradient descent started at $\theta=0$ converges exactly to this pseudo-inverse solution (see the analysis in the recommended reading). This is an implicit property of GD dynamics, which turns out to be a good inductive bias.

If we add a regularization to the linear regression problem in the form of additional penalty $\lambda \| \theta \|^2$, the optimal solution will be always unique and given by \begin{align} \theta = (\Phi^{\sf T} \Phi + \lambda I)^{-1} \Phi^{\sf T} y. \end{align} This is explicit regularization towards a small norm solution. If $\lambda$ is chosen small, e.g., $10^{-6}$, we expect the explicitly and implicitly regularized solutions to be similar.

Task 1 (2p)

Use the training set size of 40 points and hidden size $D=10$. Find the optimal setting of parameters $w, b$ by linear regression. Plot the classifier decision boundary. Report its training and test error. Try several values of $D$ observing how the boundary complexity increases towards 40 and then gets smoother and smoother in the highly overparameterized mode (try $D=1000$).

The expected result for D=1000 with either pseudo-inverse or regularized with $\lambda=10^{-6}$ should be a smooth decision boundary around the Bayesian optimal solution:

Test error 4.25%. You may obtain a different decision boundary and different test error depending on the random initialization of the lifting. The displayed result correspond to the random draw according to the following order of generation:

np.random.seed(seed=1)
train_data = model.generate_sample(40)
test_data = model.generate_sample(50000)
net = MyNet(2, 1000)

The next task is to reproduce the following experiment. We fix the hidden dimension size $D=40$ and vary the training data size as $N=2, 4, \dots 100$ (we use stratified sampling, so it must be even). The observed test error (measured on independently drawn large enough test set) behaves as this:

There is a pathological behaviour from 10 to 40 samples: the amount of training data increases, but it leads to worse results. Only after the threshold of 40 more data starts to help.

In this experiment, we averaged over 100 trials for each $N$. In each trial, the lifting of size $D$ is constructed at random and the training data is drawn at random. The plot therefore shows an overall statistical tendency.

Task 2 (3p)

In the next experiment to reproduce, we fix the training data size $N=40$ and vary the hidden dimension $D = 1,10,20, \dots 200$. The observed dependence is as follows:

Again, with hidden dimensions below 40, we observe rather a classical picture of generalization: if the model capacity is too high we start to overfit and performance worsens. However, after the threshold of about 50, the test error starts decreasing again and eventually achieves even a lower value in the highly-overparameterized mode.

In this experiment, we again average over 100 trials, where in each trial the training set is sampled anew and the model of each studied hidden size is initialized anew.

Part 2. MLP

In this part we will train MLP on the same data.

Task 1. Model

Using PyTorch tensors, implement a neural network with input_size inputs, one hidden layer with hidden_size units and tanh/ReLU activations and, finally, the logistic regression model in the last layer, i.e. a linear transform and the logistic sigmoid function. Formally, the network is specified as \begin{align*} &\phi(x) = \tanh(W_0 x + b_0)\\ & s(x) = w^{\sf T} \phi(x) + b\\ & p(y{=}1|x; \theta) = \sigma( s(x) ), \end{align*} where $\sigma$ is the logistic sigmoid function and $\theta$ denotes all parameters, i.e. $\theta = (W_0,b_0,w,b)$. Assume that $x$ represents a matrix of all data points and has size $[N \times \rm{input\_size}]$, where $N$ is the number of data points. The hidden layer output should be of size $[N \times \rm{hidden\_size}]$.

Given the training set $(x_i,y_i)_{i=1}^{N}$, where $y_i\in\{-1,1\}$, the training loss is the negative log-likelihood (NLL) of the data: $$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i | x_i; \theta) = -\frac{1}{N}\sum_{i=1}^{N} \log \sigma (y_i s(x_i)). $$

Note, because exponents can quickly overflow, in real applications a numerically stable implementation of logarithm of sigmoid function is needed. There are logsigmoid, log_softmax, logsumexp, nll_loss functions available in PyTorch. In this lab it should not be an issues, but we encourage you to use them.

Task 2. Network training (3p)

Implement gradient descent step with constant step length $\varepsilon$ in all network parameters $\theta$: $$ \theta^{t+1} = \theta^{t} - \varepsilon \nabla_\theta\mathcal{L}(\theta^t). $$

Train the network, by performing several gradient descent steps, following the template. Verify that the training loss improves during the training.

Train a network with hidden_size = 5 on a training set of $200$ points with learning rate 0.1 for 1000 epochs.
Plot the resulting classifier decision boundary with the help of G2Model.plot_predictor the “interface” Predictor class in the template.
Repeat the experiment increasing the number of hidden units to 10, 100, 500. What we want to observe is that the classifier decision boundary fits the data better but stays smooth despite the abundant amount of parameters.

Draw a test set from the ground truth model. Compute and report the empirical estimate of the test error and the generalization gap (difference between training and test error).

Report and discuss: test set size, test error rate and classification boundary plots for hidden_size = 5, 100, 500. Example classification boundary for 500 hidden units:

500 tanh units	500 ReLU units

Task 3. Test Error (2p)

After we trained our probabilistic predictive model $p(y|x)$, we want to use it for decision making. Let $l(y,y') = [[y \neq y']]$ be the 0-1 loss, penalizing the prediction $y'$ if the true class was $y$. Recall that the optimal classification decision strategy according to this loss function and the predictive model $p(y|x)$ is \begin{align*} h(x) = \arg\max_y p(y|x) = \arg\max_y \sigma (y s(x)) = {\rm sign}(s(x)). \end{align*} The risk is then the error rate of the model. Assume we have a test set $T = (x_i,y_i)_{i=1}^m$ of a fixed size $m=1000$. The empirical risk (empirical test error) is \begin{align*} R_T = \frac{1}{m} \sum_i [[y_i \neq h(x_i)]]. \end{align*} As a function of a randomly drawn test set, $R_T$ is itself random. It is important to keep that in mind, when reporting the test performance of a model or deciding which of several proposed models is better based on their test performance.

In this part we will inspect in more detail the randomness of the test error and use bootstrap method and concentration inequalities to construct confidence intervals.

True error distribution

The result of evaluation on a randomly drawn test set of this size is random. Repeat generating the test set and evaluate accuracy of your model 10000 times and record all evaluation results. Plot the histogram of the empirical test error from these 10000 trials. It gives us a clear picture, how the measured test performance may fluctuate due to the randomness of the test set selection. The mean of this distribution is the true test error rate and our empirical test error rate from one test set can be off.

Bootstrap

In real applications we don't have the possibility to draw independent test sets multiple times. Now suppose we have only a single fixed test set $T$. Let $e_i = 1$ if the network makes an error in classifying test point $x_i$ and zero otherwise, i.e. $$e_i = l(y_i, h(x_i)) = [[ y_i \neq h(x_i) ]]$$ The empirical test error rate is then the mean statistics of $e_i$: $R_T = \frac{1}{m} \sum_{i=1}^m e_i$. Bootstrap method draws the data from our empirical population multiple times with replacement. This mimics drawing test sets multiple times. Evaluating the mean statistics on such multiple bootstrap samples, we can estimate the distribution of the statistic.

Use the bootstrap function from SciPy (version > 1.10.0) on the array of errors $e$:

   res = scipy.stats.bootstrap((err.numpy(),), np.mean, confidence_level=alpha, method='BCa')

It will return the bootstrap distribution and a confidence interval. Plot the histogram of this bootstrap distribution and the confidence interval for confidence level $\alpha = 0.9$.

Because the error rate $R_T$ takes on a discrete set of values $\frac{1}{m}, \frac{2}{m},\dots 1$, you want each bin to contain the same number of discrete levels.

Unlike the true distribution, it is centered on the empirical test error rate, not on the true test error rate. So from run to run, it may be shifted to the left or to the right, but its shape should be more or less stable, which allows one to estimate the uncertainty, i.e. the confidence interval that bootstrap function outputs.

Concentration Inequalities

As above, we have only a single fixed test set $T$ of size $m$. We want to estimate the confidence intervals on the empirical test error $R_T$. In the Hoeffding inequality, set the confidence level $\alpha=0.9$, i.e. the desired probability that the empirical risk is within $\varepsilon$ from the true risk, $\mathbb{P}( |R(h) - R_T(h)| < \varepsilon) = \alpha$.
Solve for $\varepsilon$ given $\alpha$ and $m$. Include the formula into the report. Print the test error in the format $R_T \pm \varepsilon$. Graphically display the symmetric confidence interval $[R_T-\varepsilon, R_T+\varepsilon]$ on top of the above histogram (centered on empirical error rate).
The template includes examples of CIs using Chebyshev and Bernstein's concentration inequalities.

Discussion

In Part 1 we have investigated a rather special example and linear regression. Does it apply to deep networks? Read about Deep Double Descent on open.ai.
In Part 2 we have seen that under favorable conditions NNs overfit gracefully. What if the data distributions were overlapping?

Submit

Submit to BRUTE:

Report.md according to the template (upgrade to pdf is accepted too).
All of your source code but no data. Results computed by the code but not present in the report will not be evaluated.

Table of Contents