Search
Quick links: Schedule | Forum | BRUTE | Lectures | Labs
In this lab we will train several neural networks from scratch. We will keep the problems small to make the time of one training iteration relatively short. In practice, you may expect to meet with problems requiring days or even weeks to train the neural network on. Here, we will also not focus on the performance as much, but will rather explore the stability of the training with respect to the initialization and regularization. The take-away message from this lab could be: Training of NNs will very rarely crash as a badly written program would do. Instead, it will still train and produce some, typically inferior(!), results.
Download the template.py file with a few useful pieces of code (see below for their use).
Whenever training deep neural networks, one has to be very careful in interpreting the results. In contrast to a code which crashes when there is a bug, deep networks tend to train without complains even when data is wrong (this happened even when creating this lab!), labels are incorrect, network is not correctly connected, or optimization is badly initialized. You may even observe reasonable progress, but the results will just not be as perfect as you wanted. Or the training progress will be ok, but it will not generalize to the test data. Throughout the years, DL practitioners collected a lot of practical hints how to avoid falling into these traps. We very much recommend you to read and follow the recipe for training neural networks by Andrey Karpathy. We can confirm that most of the mistakes mentioned there really happen!
And one more tip: A very useful packages to try are lovely_tensors and lovely_numpy for debugging and einops for tensor manipulations. Give them a try ;)
lovely_tensors
lovely_numpy
einops
In the lecture, you have learned that the exploding/vanishing gradients and activations are typically a problem of very deep networks. The plot in lecture 6, slide 3 even suggests that for a shallow network with ReLU non-linearities the values of activations are within reasonable bounds. Here we will study a very small network on a small toy example and demonstrate that this does not need to be the case when one is not careful enough.
Prepare your training:
CircleDataGenerator
generate_sample
torch.nn.Module
def __init__(self, sigma, ...): ... self.sigma = sigma self.apply(self.init_weights) def init_weights(self, m): if isinstance(m, nn.Linear): nn.init.normal_(m.weight.data, 0, self.sigma) nn.init.zeros_(m.bias.data)
Once again, make sure you follow Andrey Karpathy's advices ;) It is a good time to examine the data now and to progress in small steps now.
Conduct the following experiment with the sigma parameter:
sigma
plot_decision_boundary
torch.nn.init
After that, check that a similar effect can be observed when scaling the data themselves. Include your observations on data scaling in the report. Are there other simple ways of breaking the convergence of the training?
Next we will replicate the result discussed in the lecture for deep networks. We will show that a naive initialization of a deep network may lead to exploding gradients:
forward
Finally, we will examine the dropout regularization. Instead of the toy dataset, we will use the MNIST dataset of hand-written digits.
MNISTData
model.eval()
model.train()