Search
In this lab we will implement back-propagation with numpy and learn a small neural network for binary classification of 2D features (for clarity of visualization) drawn from a known Gaussian mixture model. Material of lectures 2,3.
Use the provided template. The package contains:
tools.py
template.py
Let $x$ denote observations in $\mathbb{R}^2$ and $y\in\{0,1\}$ denote the class label. In this lab we have chosen an artificial ground truth model $p(x,y)$, accessible to you through functions of the class G2Model in tools.py. You can draw a training set and a test set (function generate_sample) from this ground truth model and use them for training and validating a neural network. You can also access the optimal Bayesian classification according to the true model (function classify) and compare it to the classification by you neural network.
G2Model
generate_sample
classify
Following the template pattern, implement in a modular way a neural network with input_size inputs, one hidden layer with hidden_size units and ReLU activations and, finally, the logistic regression model in the last layer, i.e. a linear transform and the logistic sigmoid function $S$. Formally, the network is specified as $$ p(y{=}1|x; \theta) = S(W_2 {\rm ReLU} (W_1 x + b_1) + b_2), $$ where $\theta$ denotes all parameters, e.g. $\theta = (W_1,b_1,W_2,b_2)$. Notice that $x$ represents a matrix of all data points and has size $[N \times d]$, where $N$ is the number of data points and $d$ is the network input size. Therefore hidden layer output should be of size $[N \times \rm{hidden\_size}]$. Hint: np.einsum is very intuitive function for handling different multiplications of high-dimensional tensors.
input_size
hidden_size
np.einsum
The training loss is the negative log-likelihood (NLL) of the data: $$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i | x_i; \theta), $$ where $N$ is the size of the training set.
When using log conditional likelihood as the learning objective, it is numerically more stable to implement the logarithm of the sigmoid function together. If $a \in \mathbb{R}$ denotes the score, i.e. the output of the last linear layer, we get $$ -\log p(y=1 \mid a) = -\log S(a) = \log(1 + e^{-a}) \\ -\log p(y=0 \mid a) = -\log(1 - S(a)) = -\log S(-a) = \log(1 + e^a). $$ Furthermore, it is convenient that the output of the network is considered to be the scores $a$ (which correspond to log odds or, more precisely, in case of the logistic sigmoid model called logits). The classification can be performed by comparing $a$ with $0$ and the training can be performed by evaluating log sigmoid probability for the target class, which is convenient to implement as a single module.
Initialize all network parameters randomly, e.g.~uniformly in $[-1,1]$.
Following the template, implement the computation of the gradient of the loss in all parameters in a modularized form by backpropagation.
Test correctness of your implementation by the following procedure. Consider varying a parameter vector $w \in \mathbb{R}^n$ (the model has several parameter vectors, considering one at a time will help to isolate errors). Keeping all other parameters fixed, compute $$ \delta = \frac{\mathcal{L}(w + \varepsilon) - \mathcal{L}(w - \varepsilon)}{2} $$ for some small random $\varepsilon \in \mathbb{R}^n$. This should match the scalar product $\langle \nabla_w L, \varepsilon \rangle$. More precisely, if $\mathcal{L}$ is differentiable in $w$ in some neighborhood of $w$, we expect $$ \delta - \langle \nabla_w \mathcal{L}, \varepsilon \rangle = o(\|\varepsilon\|). $$ The step size $\varepsilon$ should be chosen much smaller than $1$ (order of coordinates of $w$ at initialization) but within the numerical precision of the (double) floating point format. Decreasing $\varepsilon$ should result in a lower relative error $$ (\delta - \langle \nabla_w \mathcal{L}, \varepsilon \rangle) / \|\varepsilon\|. $$
Report the results of gradient verification. Example for $\varepsilon=10^{-5}$:
Grad in W1 error: -1.1514488413817054e-12 Grad in b1 error: 5.97456285573448e-12 Grad in W2 error: 3.6022765019577574e-10 Grad in b2 error: -6.396881400889768e-11
Implement gradient descent with constant step size to train the network following the template. Verify that the training loss improves during the training. Train a network with hidden_size = 5 on a training set of $200$ points with learning rate 0.1 for 100 epochs (set in the template). Plot the resulting classifier decision boundary with the help of G2Model.plot_predictor. Repeat the experiment increasing the number of hidden units to 10, 100, 500. What we want to observe is that the classifier decision boundary fits the data better but stays smooth despite the abundant amount of parameters.
G2Model.plot_predictor
Draw a test set from the ground truth model. Compute and report the empirical estimate of the test error and the generalization gap. Use the Hoeffding inequality (see lecture 2: Example 1) to select the test set size so that the probability of being off in the estimate of the test accuracy by more than 1% is less than 0.01.
Report: test set size, test error rate and classification boundary plots for hidden_size = 5, 100, 500. Example classification boundary for 500 hidden units:
This assignment is independent from the back-propagation assignment. The goal is to observe the double descent phenomenon discussed in lecture 2 on our simulated data. For this will closely follow the experimental setup described by (Belkin 2018, Appendix C.2).
Belkin et al. 2018. Reconciling modern machine-learning practice and the classical bias–variance trade-off
The neural network has the same structure with one hidden layer. However the first layer weights and biases are drawn at random and not learned. Use the following code snippet for their initialization:
W1 = (np.random.rand(hidden_size, input_size) * 2 - 1) W1 /= np.linalg.norm(W1, axis=1).reshape(hidden_size, 1) b1 = (np.random.rand(hidden_size) * 2 - 1) * 2
Use the training set size of 40 points. Vary the number of hidden units from 1 to 100 (in steps of 2 to 10) and for each $\rm{hidden\_size}$ solve the linear regression problem as discussed above, compute its train and test MSE losses and error rates. Plot the losses and error rates versus $\rm{hidden\_size}$. The result will depend on the random initialization of the first layer. Repeat the experiment for $T=100$ trials and average the loss and error graphs over all trials. Report the averaged plots. Here is an example of averaged plots (1000 trials, training set is kept the same):