Lab 4: Training from Scratch: Initialization & Regularization

In this lab we will train several neural networks from scratch. We will keep the problems small to make the time of one training iteration relatively short. In practice, you may expect to meet with problems requiring days or even weeks to train the neural network on. Here, we will also not focus on the performance as much, but will rather explore the stability of the training with respect to the initialization and regularization. The take-away message from this lab could be: Training of NNs will very rarely crash as a badly written program would do. Instead, it will still train and produce some, typically inferior(!), results.

Download the template.py file with a few useful pieces of code (see below for their use).

Practical Advice

Whenever training deep neural networks, one has to be very careful in interpreting the results. In contrast to a code which crashes when there is a bug, deep networks tend to train without complains even when data is wrong (this happened even when creating this lab!), labels are incorrect, network is not correctly connected, or optimization is badly initialized. You may even observe reasonable progress, but the results will just not be as perfect as you wanted. Or the training progress will be ok, but it will not generalize to the test data. Throughout the years, DL practitioners collected a lot of practical hints how to avoid falling into these traps. We very much recommend you to read and follow the recipe for training neural networks by Andrey Karpathy. We can confirm that most of the mistakes mentioned there really happen!

And one more tip: A very useful packages to try are lovely_tensors and lovely_numpy for debugging and einops for tensor manipulations. Give them a try ;)

Part 1: Initialization - Shallow Network (3p)

In the lecture, you have learned that the exploding/vanishing gradients and activations are typically a problem of very deep networks. The plot in lecture 6, slide 3 even suggests that for a shallow network with ReLU non-linearities the values of activations are within reasonable bounds. Here we will study a very small network on a small toy example and demonstrate that this does not need to be the case when one is not careful enough.

Prepare your training:

The provided template contains a simple 2D data generator, CircleDataGenerator. Use its generate_sample method to create a training set of size 100.

Then create a small fully connected network with two hidden layers (e.g. with 6 and 3 neurons respectively) and ReLU non-linearities. Build it as a class inherited from torch.nn.Module. For the purpose of this experiment, make it parametrized by the sigma of the Normal distribution used to initialize the layer weights. One way to achieve this is:
```
    def __init__(self, sigma, ...):
        ...
        self.sigma = sigma
        self.apply(self.init_weights)
 
    def init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.normal_(m.weight.data, 0, self.sigma)
            nn.init.zeros_(m.bias.data)
```

Use a similar training loop as in the previous lab with the following specifics:
- Select a fixed learning rate (e.g. 0.1).
- Use SGD optimizer and set the batch size to the training set size (i.e. we are doing GD).
- Use the cross entropy loss.
- Let the training run for a reasonable number of steps, so that the phenomena discussed further can be easily observed (e.g. 1000).

Sample a (big enough) validation set and report also validation loss and validation classification accuracy.

Once again, make sure you follow Andrey Karpathy's advices ;) It is a good time to examine the data now and to progress in small steps now.

Conduct the following experiment with the sigma parameter:

Find three values:
- one small enough which slows the training initially but still allows arriving at a good solution,
- one big enough to disturb the training so much, that it still does converge, but does not recover fully from the initialization,
- one good enough that neither of the above two effects appear.
Discuss the three values and their effect on training. Examine the achieved loss, its progress and plot the decision boundary for every setup (you may use the provided plot_decision_boundary function). In particular, consider the advice from Karpathy's recipe to check the loss at init to be around $-\log(1/n_{classes})$.
Compare with the Glorot (Xavier in PyTorch) initialization (implement it yourself or use the one implemented in PyTorch in torch.nn.init).

After that, check that a similar effect can be observed when scaling the data themselves. Include your observations on data scaling in the report. Are there other simple ways of breaking the convergence of the training?

Part 2: Initialization - Deep Network (3p)

Next we will replicate the result discussed in the lecture for deep networks. We will show that a naive initialization of a deep network may lead to exploding gradients:

Adapt the network from the previous task to have 20 layers. Make each layer contain more neurons (~100) - this will be useful for averaging the statistics.
Modify the forward method to collect also the activation statistics, i.e. the mean and std of the activation abs value.
Sample much larger sample set from the same data generator as before and run it through an un-trained model to collect the statistics.
Report and discuss the activation statistics when the weights are initialized using N(0, 1) and when they are initialized using the the Glorot and He (Kaiming in PyTorch) initializations.

Part 3: Regularization - Dropout (4p)

Finally, we will examine the dropout regularization. Instead of the toy dataset, we will use the MNIST dataset of hand-written digits.

The provided template includes a basic data loading code in the class MNISTData. When used for the first time, it downloads the dataset. Notice, that we normalize the data to zero mean and unit variance (see the data scaling problem in the previous section) and take only 5000 training and 10000 validation samples to make the coding iterations faster. As long as we are developing our method, we shouldn't touch the test part of the data in order not to overfit to them by too many attempts.
Create another fully-connected network, this time with input size 784 (=28×28), two hidden layers of size 800 and the output layer with 10 units. A similar network was used in the original Dropout paper. Use dropout after every hidden layer and apply it also to the input. For initialization use N(0, 0.1) for the weights and zero for biases as in the original paper.
For optimization, the easiest is to use AdamW optimizer (more on it in the coming lecture) with “standard rule-of-thumb” initial learning rate 3e-4. This will allow us not to focus on the optimization part.
Compare the two variants: one with 50% dropout and another without dropout (the paper uses also 20% dropout on the input). Measure the training and validation loss as well as training and validation classification errors in percent.
Hint: Make sure you switch the network to the model.eval() mode when doing the evaluations.
Discuss the ability of dropout to regularize the training based on your observations.
Finally, try to use the model trained with dropout as an ensemble. Instead of switching off the dropout during the evaluation, keep the model in model.train() mode, pass the data through the model repeatedly and average the output logits before doing the final classification. Try small and also really large ensembles. Discuss the performance of this model compared to the one with dropout switched off.

Table of Contents

Lab 4: Training from Scratch: Initialization & Regularization

Practical Advice

Part 1: Initialization - Shallow Network (3p)

Part 2: Initialization - Deep Network (3p)

Part 3: Regularization - Dropout (4p)