Lab 4: Training from Scratch: Initialization & Regularization

In this lab we will train several neural networks from scratch. We will keep the problems small to make the time of one training iteration relatively short. In practice, you may expect to meet with problems requiring days or even weeks to train the neural network on. Here, we will also not focus on the performance as much, but will rather explore the stability of the training with respect to the initialization and regularization. The take-away message from this lab could be: Training of NNs will very rarely crash as a badly written program would do. Instead, it will still train and produce some, typically inferior(!), results.

Download the template.py file with a few useful pieces of code (see below for their use).

Practical Advice

Whenever training deep neural networks, one has to be very careful in interpreting the results. In contrast to a code which crashes when there is a bug, deep networks tend to train without complains even when data is wrong (this happened even when creating this lab!), labels are incorrect, network is not correctly connected, or optimization is badly initialized. You may even observe reasonable progress, but the results will just not be as perfect as you wanted. Or the training progress will be ok, but it will not generalize to the test data. Throughout the years, DL practitioners collected a lot of practical hints how to avoid falling into these traps. We very much recommend you to read and follow the recipe for training neural networks by Andrej Karpathy. We can confirm that most of the mistakes mentioned there really happen!

And one more tip: A very useful packages to try are lovely_tensors and lovely_numpy for debugging and einops for tensor manipulations. Give them a try ;)

Part 1: Initialization - Deep Network (4p)

We start by replicating the results of Xavier Glorot and Kaiming He and their colleagues on deep network weights initialization (see also the lecture 6).

Create a fully connected network with many (e.g. 50) wide linear layers (e.g. 512 neurons per layer). Modify the forward method to collect also the activation statistics, i.e. the mean and std of the activation values per layer. We will ignore the bias term in the linear layer for this exercise.

Pass a random vector (zero mean, unit variance, large batch size) through the network and display the activation statistics.
Add the cross entropy loss to the output of the network and back-propage the gradients given randomly generated labels (we do not perform any training in this scenario). Display the mean and std of the gradients per layer similarly to the activation statistics (e.g. by Matplotlib's errorbar function).

Experiment with the following combinations of activation functions and layer weight matrix $W$ initializations:

Original heuristic: tanh activation, initialize $W_{ij} \sim U\left[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\right]$, where $U[-a, a]$ is the uniform distribution in the interval $(-a, a)$ and $n$ is the size of the previous layer. This heuristic was typically used before Glorot et al. introduced their initialization.
Xavier uniform: tanh activation, $W_{ij} \sim U\left[-\sqrt{\frac{6}{n + m}}, \sqrt{\frac{6}{n + m}}\right]$, where $n$ is the number of input features and $m$ the number of output features.
Xavier normal: tanh activation, $W_{ij} \sim \mathcal{N}(0, \frac{2}{n + m})$, where $\mathcal{N}(\mu, \sigma^2)$ is the Normal distribution with mean $\mu$ and variance $\sigma^2$. This is an alternative initialization introduced in Glorot et al..
Xavier uniform/normal with ReLU activation.
Kaiming uniform: ReLU activation, $W_{ij} \sim U\left[-\sqrt{\frac{3}{n}}, \sqrt{\frac{3}{n}}\right]$. This initialization is designed to work for non-symmetrical ReLU activation.
Kaiming normal: ReLU activation, $W_{ij} \sim \mathcal{N}(0, \frac{2}{n})$.

You may implement the initializations yourself, or feel free to call the PyTorch versions. A simple way to add weight initialization is like this:

    def __init__(self, init_type, ...):
        ...
        self.init_type = init_type
        self.apply(self.init_weights)
 
    def init_weights(self, m):
        if isinstance(m, nn.Linear):
            if self.init_type == 'name of init type':
                nn.init.uniform_(m.weight.data, -a, a)
                # or
                nn.init.normal_(m.weight.data, 0, sigma)
            nn.init.zeros_(m.bias.data)

Discuss your observations.

Part 2: Initialization - Shallow Network (4p)

In the lecture, you have learned that the exploding/vanishing gradients and activations are typically a problem of very deep networks. The plot in lecture 6, slide 3 even suggests that for a shallow network with ReLU non-linearities the values of activations are within reasonable bounds. Here we will study a very small network on a small toy example and demonstrate that this does not need to be the case when one is not careful enough.

Prepare your training:

The provided template contains a simple 2D data generator, CircleDataGenerator. Use its generate_sample method to create a training set of size 100.

Then create a small fully connected network with two hidden layers (e.g. with 6 and 3 neurons respectively) and ReLU non-linearities (which initializaiton will you use?). [EDIT: this time you do need a bias term in the first layer or you cannot classify the normalized circular data and you need a bias in the last layer for sure]. Build it as a class inherited from torch.nn.Module. Use the initializations from part 1.

Use a similar training loop as in the previous lab with the following specifics:
- Use AdamW optimizer and choose appropriate initial learning rate (e.g. 3e-4 as suggested by Karpathy).
- Set the batch size to the training set size (i.e. we are doing GD).
- Use the cross entropy loss.
- Let the training run for a reasonable number of steps, so that the phenomena discussed further can be easily observed (e.g. 5000).

Sample a (big enough) validation set and report also validation loss and validation classification accuracy.

Conduct the following experiments:

First, try to train the network to its best performance.
Remind yourself of Andrej Karpathy's advices: examine your data and consider the advice to check the initial loss to be around $-\log(1/n_{classes})$. Normalize the data to zero mean and unit variance and try training the network again to its best.
Multiply the normalized data by 0.01 and try training again to its best.

Discuss the difference in the training before and after the normalization and with data scaled down. To visualize the results use the provided plot_decision_boundary function.

Hint: The performance of your network is expected to be quite good at least for one of the three experiments.

Part 3: Regularization - Dropout (2p)

Finally, we will examine the dropout regularization. Instead of the toy dataset, we will use the MNIST dataset of hand-written digits.

The provided template includes a basic data loading code in the class MNISTData. When used for the first time, it downloads the dataset. It normalizes the data to zero mean and unit variance (see the data scaling problem in the previous section) and takes only 5000 training and 10000 validation samples to make the coding iterations faster. As long as we are developing our method, we shouldn't touch the test part of the data in order not to overfit to them by seeing the test results in too many attempts.
Create another fully-connected network, this time with input size 784 (=28×28), two hidden layers of size 800, ReLU activations and the output layer with 10 units. A similar network was used in the original Dropout paper. Use dropout after every hidden layer and apply it also to the input. For initialization use N(0, 0.1) for the weights and zero for biases as in the original paper.
For optimization, the easiest is to use AdamW optimizer (more on it in the coming lecture) with “standard rule-of-thumb” initial learning rate 3e-4. This will allow us not to focus on the optimization part.
Compare the two variants: one with 50% dropout and another without dropout (the paper uses also 20% dropout on the input, feel free to test it too if you want). Measure the training and validation loss as well as training and validation classification errors in percent. Discuss the ability of dropout to regularize the training based on your observations.
Hint: Make sure you switch the network to the model.eval() mode when doing the evaluations.
Finally, try to use the model trained with dropout as an ensemble. Instead of switching off the dropout during the evaluation, keep the model in model.train() mode, pass the data through the model repeatedly and average the output logits before doing the final classification. Try small (tens) and also really large ensembles (thousands). Discuss the performance of this model compared to the one with dropout switched off.

Table of Contents

Lab 4: Training from Scratch: Initialization & Regularization

Practical Advice

Part 1: Initialization - Deep Network (4p)

Part 2: Initialization - Shallow Network (4p)

Part 3: Regularization - Dropout (2p)