Quick links: Schedule | Forum | BRUTE | Lectures | Labs
In this lab we will train several neural networks from scratch. We will keep the problems small to make the time of one training iteration relatively short. In practice, you may expect to meet with problems requiring days or even weeks to train the neural network on. Here, we will also not focus on the performance as much, but will rather explore the stability of the training with respect to the initialization and regularization. The take-away message from this lab could be: Training of NNs will very rarely crash as a badly written program would do. Instead, it will still train and produce some, typically inferior(!), results.
Download the template.py file with a few useful pieces of code (see below for their use).
Whenever training deep neural networks, one has to be very careful in interpreting the results. In contrast to a code which crashes when there is a bug, deep networks tend to train without complains even when data is wrong (this happened even when creating this lab!), labels are incorrect, network is not correctly connected, or optimization is badly initialized. You may even observe reasonable progress, but the results will just not be as perfect as you wanted. Or the training progress will be ok, but it will not generalize to the test data. Throughout the years, DL practitioners collected a lot of practical hints how to avoid falling into these traps. We very much recommend you to read and follow the recipe for training neural networks by Andrey Karpathy. We can confirm that most of the mistakes mentioned there really happen!
And one more tip: A very useful packages to try are lovely_tensors
and lovely_numpy
for debugging and einops
for tensor manipulations. Give them a try ;)
In the lecture, you have learned that the exploding/vanishing gradients and activations are typically a problem of very deep networks. The plot in lecture 6, slide 3 even suggests that for a shallow network with ReLU non-linearities the values of activations are within reasonable bounds. Here we will study a very small network on a small toy example and demonstrate that this does not need to be the case when one is not careful enough.
Prepare your training:
CircleDataGenerator
. Use its generate_sample
method to create a training set of size 100.
torch.nn.Module
. For the purpose of this experiment, make it parametrized by the sigma of the Normal distribution used to initialize the layer weights. One way to achieve this is: def __init__(self, sigma, ...): ... self.sigma = sigma self.apply(self.init_weights) def init_weights(self, m): if isinstance(m, nn.Linear): nn.init.normal_(m.weight.data, 0, self.sigma) nn.init.zeros_(m.bias.data)
Once again, make sure you follow Andrey Karpathy's advices ;) It is a good time to examine the data now and to progress in small steps now.
Conduct the following experiment with the sigma
parameter:
plot_decision_boundary
function). In particular, consider the advice from Karpathy's recipe to check the loss at init to be around $-\log(1/n_{classes})$.
torch.nn.init
).
After that, check that a similar effect can be observed when scaling the data themselves. Include your observations on data scaling in the report. Are there other simple ways of breaking the convergence of the training?
Next we will replicate the result discussed in the lecture for deep networks. We will show that a naive initialization of a deep network may lead to exploding gradients:
forward
method to collect also the activation statistics, i.e. the mean and std of the activation abs value.
Finally, we will examine the dropout regularization. Instead of the toy dataset, we will use the MNIST dataset of hand-written digits.
MNISTData
. When used for the first time, it downloads the dataset. Notice, that we normalize the data to zero mean and unit variance (see the data scaling problem in the previous section) and take only 5000 training and 10000 validation samples to make the coding iterations faster. As long as we are developing our method, we shouldn't touch the test part of the data in order not to overfit to them by too many attempts.
model.eval()
mode when doing the evaluations.model.train()
mode, pass the data through the model repeatedly and average the output logits before doing the final classification. Try small and also really large ensembles. Discuss the performance of this model compared to the one with dropout switched off.