Search
In this lab, we: (i) learn how to make a simple neural network (NN) in a pure NumPy, and (ii) build and train Convolutional Neural Networks (CNN) using the PyTorch framework. In particular, we will train several different CNN classifiers of fashion articles from 28×28 FashionMNIST grayscale images.
Today, there are many resources available about deep learning and CNNs. You can find some useful ones at the bottom of this page. However, hold on with reading them all at first. In this lab, we are going slowly. We add one component after another. There are links to relevant explanations in the text itself, so whenever you meet a new term, you also learn its meaning. You may want to revisit these links or check some other resources after finishing the lab.
General information for Python development.
To fulfil this assignment, you need to submit these files (all packed in a single .zip file) into the upload system:
.zip
numpy_nn.py
ReLU
Linear
Sigmoid
SE
numpy_nn_training.png
numpy_nn_classification.png
pytorch_cnn.py
MyNet
classify
model.pt
Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.
The challenge: The network will be automatically evaluated by the upload system and ranked in the online score board (the real-time update). The gained points from the assignment depend on the rank of your algorithm (assuming your code passes all the AE tests):
Deadline for the submission is Sun Jan 10 23:59. Then the points will be assigned. Every later submission is then for 6 points. You have to be better than the baseline to complete this lab.
And the winners of the CNN challenge 2020/21 are… :)
We start by implementing few simple layers in pure NumPy to get the understanding of what is going on in the machine learning frameworks like PyTorch, TensorFlow, etc.
Recall, that a neural network is an acyclic graph with individual nodes being either simple Perceptrons (with non-linearity) or other 'layers'. A layer is an object which has two main functionalities: (i) it can pass the data forward during the prediction/classification phase – implemented in layer.forward method, and (ii) it computes gradients with respect to (w.r.t.) its inputs which are then passed backward during back-propagation training – implemented in layer.backward method. This allows to chain many layers into possibly large graphs and then compute efficiently the gradient of the loss function w.r.t. every parameter.
layer.forward
layer.backward
We recommend very much the lecture by Karpathy (or lecture by Shekhovtsov) for a clear explanation of the gradient propagation during back-propagation. In short, one starts first by propagating the input data through the forward calls in each layer sequentially from the first to the last layer. Then, starting at the last layer, one computes the gradients (with the backward function) and passes them to the preceding layers. This is then repeated in every layer. After all the gradients are computed, we train the network by gradient descent.
forward
backward
You will see that out of all layers which we consider, only the fully-connected layer (or linear layer as we call it) has some trainable parameters. During training, we update by the partial derivatives of the loss w.r.t. the parameters of the network. You might consider the parameters as another extra input of the layer (along with the outputs from the previous layer). For a layer with trainable parameters, there is also a method layer.grads in the template which should return the gradients of the loss w.r.t. its parameters.
layer.grads
Implement the following layers.
In the forward method, the Linear layer implements the function $f(\mathbf{x}) = \mathbf{W} \mathbf{x} + \mathbf{b}$. Hint: You may need to store the inputs $x$ for later use in the backward method.
In the backward method, the gradient of the layer w.r.t. $\mathbf{W}$ and $\mathbf{b}$ is computed. The method returns the gradient of the loss w.r.t. the layer's input. Both forward and backward methods must work with several data samples stacked together (these are called batches or minibatches). Make sure that you check your shapes against the docstring specifications! Remember, CHECK YOUR SHAPES! Reshaping, adding extra dimensions or transposing the arrays may come handy (you might want to check the NumPy - HOW TO thread at the forum).
Make sure that the gradients' computation is working for batched data and their shapes are the same as the docstring tells you. If you are getting a weird error in BRUTE, it is very likely that your shapes are wrong!
ReLU is a commonly used non-linear layer which computes $f(\mathbf{x}) = \text{max}(\mathbf{0}, \mathbf{x})$. Both forward and backward methods are much simpler as the layer has no parameters.
Make sure it works for arbitrarily shaped inputs (even 3D, 4D, or more-D)!
And again, you will need to remember what happened during the forward pass in order to compute the gradient correctly.
You already know the logistic sigmoid from the logistic regression lab. The Sigmoid layer implements the function $f(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}}}$. As in the ReLU layer, there are no parameters in the Sigmoid layer. The backward method only computes the gradient of the loss w.r.t. the layer inputs.
The squared error loss computes $(\mathbf{x}-\mathbf{y})^2$. The backward pass computes the gradient of the loss only w.r.t. the input $x$.
Your task is to train a network for a binary classification of two selected classes (numbers) from the MNIST dataset. The data loading, model definition, and training is done in the if __name__ == '__main__': section of the numpy_nn.py template.
if __name__ == '__main__':
Experiment with the hyper-parameter settings (Slow training? Try increasing learning_rate. No convergence? Try decreasing learning_rate) and the model architecture. A typical model has several layers organised as Linear → ReLU → Linear → ReLU → … → Linear → Sigmoid, but many other options are possible.
learning_rate
model
Try to achieve results like these.
Save the training images as numpy_nn_training.png and numpy_nn_classification.png
Our simple NumPy NN works well for a simple two digit classification problem, but if we want to solve a more difficult task, it quickly becomes inefficient and then it is better to use one of the highly optimized public libraries. We will work with the currently most popular CNN framework - PyTorch. Writing the neural network using PyTorch is straightforward - there are many layers readily available and you can operate on the data passing through the network just like you would in plain NumPy.
A template for this part of the assignment is in pytorch_cnn.py.
Start by installing PyTorch - follow the instructions at https://pytorch.org/get-started/locally/. Radim Shpetleek™® recommends:
conda install pytorch torchvision cpuonly -c pytorch
Your task is to classify images of fashion articles. Fortunately, there is quite a lot of annotated data readily available on the internet: FashionMNIST. To simplify loading the data, use torchvision FashionMNIST dataset which downloads and prepares the data for you (see the template).
Unfortunately, during the testing phase, your customer (a company called BRUTE & SON ) is using a private collection of real-world images. The testing set has the same classes and similar statistics as the training data. Nevertheless, even though the style of article images in FashionMNIST dataset is representative enough, the test images are not from FashionMNIST test subset.
If you intend pass this lab only, you should be fine with custom architecture and 10% of FashionMNIST dataset in your training. However, if you want to earn some bonus points by training the best performing network, you should use the data augmentation technique and make your network robust to brightness and color shifts expected in the real-world data. We have tried with FashionMNIST data only and things did not go well. That's why we are hiring you after all ;)
Examples of FashionMNIST data:
Lets start with a simple one layer fully-connected network. You should now understand how it works under the hood after the first part of this assignment. Here we will need to extend the binary classification to the multi-class one. Fortunately, this is quite easy using a softmax layer.
The template contains a code for training a simple one layer network FCNet with a log-softmax regression on the output and trained using the stochastic gradient descent.
FCNet
Tasks:
train
One of the main disadvantages of using the fully-connected layers on images is that they do not take into account the spatial structure of the image. Imagine that you randomly perturb spatial arrangement of image pixels (in both training and test data) and re-train the fully-connected network. These perturbed images become completely unlearnable for humans, since humans make a prior assumption about the spatial arrangement. Nevertheless, the expected testing error of the re-trained network on this randomly perturbed dataset will be the same, since it does not make any assumptions and learns the spatial arrangement from the scratch from perturbed training data. When we learn on images, architecture of the network should reflect the particular spatial arrangement.
We impose the spatial arrangement by introducing the convolutional layers. The convolution works as shifting a local template (often called a convolution kernel or a local receptive field) over the image and computing its response for every single position in the image. For example, when the input image is 28×28, and we compute convolution with 5×5 kernel, than resulting response image will be 24×24 (unless we pad the image with zeros). When learned, these templates often corresponds to edge or corner detectors.
Another disadvantage of the fully-connected layers is that the number of parameters grows quickly with new layers. This means significantly more parameters need to be learned and thus more data need to be used to avoid overfitting.
SimpleCNN
Beware! You are supposed to fine-tune the network using your validation set. The number of uploads is not limited, but keep it small. If we see unreasonably many uploads for one student, the network could be disqualified as over-fitted to the test data!
Note that some students report a better user experience when training their pytorch network in Google Colab.