Search
Quick links: Schedule | Forum | BRUTE | Lectures | Labs
In this lab we will train several neural networks from scratch. We will keep the problems small to make the time of one training iteration relatively short. In practice, you may expect to meet with problems requiring days or even weeks to train the neural network on. Here, we will also not focus on the performance as much, but will rather explore the stability of the training with respect to the initialization and regularization. The take-away message from this lab could be: Training of NNs will very rarely crash as a badly written program would do. Instead, it will still train and produce some, typically inferior(!), results.
Download the template.py file with a few useful pieces of code (see below for their use).
Whenever training deep neural networks, one has to be very careful in interpreting the results. In contrast to a code which crashes when there is a bug, deep networks tend to train without complains even when data is wrong (this happened even when creating this lab!), labels are incorrect, network is not correctly connected, or optimization is badly initialized. You may even observe reasonable progress, but the results will just not be as perfect as you wanted. Or the training progress will be ok, but it will not generalize to the test data. Throughout the years, DL practitioners collected a lot of practical hints how to avoid falling into these traps. We very much recommend you to read and follow the recipe for training neural networks by Andrej Karpathy. We can confirm that most of the mistakes mentioned there really happen!
And one more tip: A very useful packages to try are lovely_tensors and lovely_numpy for debugging and einops for tensor manipulations. Give them a try ;)
We start by replicating the results of Xavier Glorot and Kaiming He and their colleagues on deep network weights initialization (see also the lecture 6).
Create a fully connected network with many (e.g. 50) wide linear layers (e.g. 512 neurons per layer). Modify the forward method to collect also the activation statistics, i.e. the mean and std of the activation values per layer. We will ignore the bias term in the linear layer for this exercise.
forward
errorbar
Experiment with the following combinations of activation functions and layer weight matrix $W$ initializations:
ReLU
You may implement the initializations yourself, or feel free to call the PyTorch versions. A simple way to add weight initialization is like this:
def __init__(self, init_type, ...): ... self.init_type = init_type self.apply(self.init_weights) def init_weights(self, m): if isinstance(m, nn.Linear): if self.init_type == 'name of init type': nn.init.uniform_(m.weight.data, -a, a) # or nn.init.normal_(m.weight.data, 0, sigma) nn.init.zeros_(m.bias.data)
Discuss your observations.
In the lecture, you have learned that the exploding/vanishing gradients and activations are typically a problem of very deep networks. The plot in lecture 6, slide 3 even suggests that for a shallow network with ReLU non-linearities the values of activations are within reasonable bounds. Here we will study a very small network on a small toy example and demonstrate that this does not need to be the case when one is not careful enough.
Prepare your training:
CircleDataGenerator
generate_sample
torch.nn.Module
Conduct the following experiments:
Discuss the difference in the training before and after the normalization and with data scaled down. To visualize the results use the provided plot_decision_boundary function.
plot_decision_boundary
Hint: The performance of your network is expected to be quite good at least for one of the three experiments.
Finally, we will examine the dropout regularization. Instead of the toy dataset, we will use the MNIST dataset of hand-written digits.
MNISTData
model.eval()
model.train()