Warning

This page is located in archive.

In this lab, we: (i) learn how to make a simple neural network (NN) in a pure NumPy, and (ii) build and train Convolutional Neural Networks (CNN) using the PyTorch framework. In particular, we will train several different CNN classifiers of handwritten digits (0-9) from 28×28 MNIST grayscale images.

Today, there are many resources available about deep learning and CNNs. You can find some useful ones at the bottom of this page. However, hold on with reading them all at first. In this lab, we are going slowly. We add one component after another. There are links to relevant explanations in the text itself, so whenever you meet a new term, you also learn its meaning. You may want to revisit these links or check some other resources after finishing the lab.

General information for Python development.

To fulfil this assignment, you need to submit these files (all packed in a single `.zip`

file) into the upload system:

- containing the following implemented classes:`numpy_nn.py`

- REctified Linear Unit layer`ReLU`

- Linear (aka Fully Connected) layer`Linear`

- Logistic Sigmoid layer`Sigmoid`

- Squared Error loss layer`SE`

`numpy_nn_training.png`

,`numpy_nn_classification.png`

- containing the following implemented classes and methods:`pytorch_cnn.py`

- Your CNN model`MyNet`

- classifies data using a trained network`classify`

- trained MyNet state`model.pt`

** Use template of the assignment.** When preparing a zip file for the upload system, **do not include any directories**, the files have to be in the zip file root.

**The challenge:** The network will be automatically evaluated by the upload system and ranked in the online score board. The gained points from the assignment depend on the rank of your algorithm (assuming your code passes all the AE tests):

- 1st place: 16 points
- 2nd place: 14 points
- 3rd place: 12 points
- 4th place: 10 points
- every submission with performance worse than the baseline: 0 points
- 8 points otherwise

**Deadline for the submission is Sun Jan 12 23:59.** Then the points will be assigned. Every later submission is then for 6 points. You have to be better than the baseline to complete this lab.

We start by implementing few simple layers in pure NumPy to get the understanding of what is going on in the machine learning frameworks like PyTorch, TensorFlow, etc.

Recall, that a neural network is an acyclic graph with individual nodes being either simple Perceptrons (with non-linearity) or other 'layers'. A layer is an object which has two main functionalities: (i) it can pass the data forward during the prediction/classification phase – implemented in ** layer.forward** method, and (ii) it computes gradients with respect to (w.r.t.) its inputs which are then passed backward during back-propagation training – implemented in

`layer.backward`

We recommend very much the lecture by Karpathy for a clear explanation of the gradient propagation during back-propagation training. In short, one starts first by propagating the input data through the ** forward** calls in each layer sequentially from the first to the last layer. Then, starting at the last layer, one computes the gradients (with the

`backward`

You will see that out of all layers which we consider, only the fully-connected layer (or linear layer as we call it) has some trainable parameters. During training, we update by the partial derivatives of the loss w.r.t. the parameters of the network. You might consider the parameters as another extra input of the layer (along with the outputs from the previous layer). For the details, see the Karpathy's lecture.

For a layer with trainable parameters, there is also a method ** layer.grads** which returns the gradients of the loss w.r.t. its parameters.

**Implement the following layers.**

In the `forward`

method, the `Linear`

layer implements the function $f(\mathbf{x}) = \mathbf{W} \mathbf{x} + \mathbf{b}$. **Hint**: You may need to store the inputs $x$ for later use in the `backward`

method.

In the `backward`

method, the gradient of the layer w.r.t. $\mathbf{W}$ and $\mathbf{b}$ is computed. The method returns the gradient of the loss w.r.t. the layer's input.
Both `forward`

and `backward`

methods must work with several data samples stacked together (these are called batches or minibatches).
Make sure that you **check your shapes against the docstring specifications!** Remember, **CHECK YOUR SHAPES!** Reshaping, adding extra dimensions or transposing the arrays may come handy (you might want to check the NumPy - HOW TO thread at the forum).

Make sure that the gradients' computation is working for batched data and their **shapes are the same as the docstring tells you**. If you are getting a weird error in BRUTE, it is very likely that your **shapes are wrong!**

ReLU is a commonly used non-linear layer which computes $f(\mathbf{x}) = \text{max}(\mathbf{0}, \mathbf{x})$. Both `forward`

and `backward`

methods are much simpler as the layer has no parameters.

Make sure it works for arbitrarily shaped inputs (even 3D, 4D, or more-D)!

And again, you will need to remember what happened during the `forward`

pass in order to compute the gradient correctly.

You already know the logistic sigmoid from the logistic regression lab. The `Sigmoid`

layer implements the function $f(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}}}$. As in the ReLU layer, there are no parameters in the `Sigmoid`

layer. The `backward`

method only computes the gradient of the loss w.r.t. the layer inputs.

The squared error loss computes $(\mathbf{x}-\mathbf{y})^2$. The `backward`

pass computes the gradient of the loss only w.r.t. the input $x$.

Your task is to train a network for a binary classification of two selected classes (numbers) from the MNIST dataset. The data loading, model definition, and training is done in the `if __name__ == '__main__': `

section of the `numpy_nn.py`

template.

Experiment with the hyper-parameter settings (Slow training? Try increasing `learning_rate`

. No convergence? Try decreasing `learning_rate`

) and the `model`

architecture. A typical model has several layers organised as Linear → ReLU → Linear → ReLU → … → Linear → Sigmoid, but many other options are possible.

- Does adding more fully-connected layers help?
- Does having more output units in the fully-connected layer help?
- What activation functions work the best?
- Experiment with the batch size and watch how it influences the training.
- Feel free to implement also another layers and non-linearities.

Try to achieve results like these.

Save the training images as `numpy_nn_training.png`

and `numpy_nn_classification.png`

Our simple NumPy NN works well for a simple two digit classification problem, but if we want to solve a more difficult task, it quickly becomes inefficient and then it is better to use one of the highly optimized public libraries. We will work with the currently most popular CNN framework - PyTorch. Writing the neural network using PyTorch is straightforward - there are many layers readily available and you can operate on the data passing through the network just like you would in plain NumPy.

A template for this part of the assignment is in `pytorch_cnn.py`

.

Start by installing PyTorch - follow the instructions at https://pytorch.org/get-started/locally/.

Your task is to classify images of hand-written digits 0-9. Fortunately, there is quite a lot of annotated data readily available on the internet: MNIST. To simplify loading the data, use torchvision MNIST dataset which downloads and prepares the data for you (see the template).

Unfortunately, during the testing phase, your customer (a company called BRUTE & SON ) is using a low quality camera with lot of noise of unknown characteristics. So, even though the hand written style of the MNIST dataset is representative enough, the images are heavily corrupted (see examples below). We do not need to mention that the camera is proprietary and you will never have a chance to test it on your images, do we?

Your only chance is to use the data augmentation technique and try to mimic the effect of the camera. You may also try training without augmentation, of course, but we have tried and failed already. That's why we are hiring you after all ;)

Examples of the clean MNIST data:

Examples of the noisy data from the camera:

Lets start with a simple one layer fully-connected network. You should now understand how it works under the hood after the first part of this assignment. Here we will need to extend the binary classification to the multi-class one. Fortunately, this is quite easy using a softmax layer.

The template contains a code for training a simple one layer network `FCNet`

with a log-softmax regression on the output and trained using the stochastic gradient descent.

**Tasks:**

- Train the
`FCNet`

network on the MNIST data (see the`train`

function in the template)

- Add another fully-connected layer with 1000 hidden units and with sigmoid non-linearity. Do you get better results?

- Try to add more layers. Does the network improve with more and more layers?

- How many weights do you learn in each case?

- Try to substitute the sigmoid non-linearity with Rectified Linear Unit (ReLU). It helps to avoid the vanishing gradient problem.

- Experiment with the number of layers, number of hidden units and try to get the best possible result.

One of the main disadvantages of using the fully-connected layers on images is that they do not take into account the spatial structure of the image. Imagine that you randomly perturb spatial arrangement of image pixels (in both training and test data) and re-train the fully-connected network. These perturbed images become completely unlearnable for humans, since humans make a prior assumption about the spatial arrangement. Nevertheless, the expected testing error of the re-trained network on this randomly perturbed dataset will be the same, since it does not make any assumptions and learns the spatial arrangement from the scratch from perturbed training data. When we learn on images, architecture of the network should reflect the particular spatial arrangement.

We impose the spatial arrangement by introducing the convolutional layers. The convolution works as shifting a local template (often called a convolution kernel or a local receptive field) over the image and computing its response for every single position in the image. For example, when the input image is 28×28, and we compute convolution with 5×5 kernel, than resulting response image will be 24×24 (unless we pad the image with zeros). When learned, these templates often corresponds to edge or corner detectors.

Another disadvantage of the fully-connected layers is that the number of parameters grows quickly with new layers. This means significantly more parameters need to be learned and thus more data need to be used to avoid overfitting.

**Tasks:**

- Train a CNN with one convolutional layer (3×3 kernel) followed by a ReLU non-linearity and a fully connected layer with log-softmax criterion (see
`SimpleCNN`

provided in the template).

Notice that the kernel is not evaluated at every position but at every second only (stride=2). This makes the second layer smaller while keeping most of the information present (as nearby convolutions result in similar values). We also added padding=1, which adds zeros to the image before computing the convolution. This way, the size of the output stays the same when stride=1 and becomes half when stride=2.

- Are you getting better results than with fully connected networks? Does taking into account the spatial arrangement help?

- How many parameters are you learning now? Compare this to the case of the two layered fully connected network.

- Add one more convolutional layer with ReLU (again with stride=2).

- Visualise the learned filters in the first convolutional layer (they are 3×3 matrices).

- To train a CNN one still needs a lot of data as the number of parameters being estimated is large. To avoid over-fitting, two another techniques are commonly used: max-pooling and dropout. Substitute the stride=2 by stride=1 and max-pooling and add a dropout layer before the fully connected layer.

- Experiment :)
**Your goal is to train the best possible network (implement it in**You can play with:`MyNet`

and submit it to the upload system for the competition.- longer training
- number of layers
- number and size of the convolutional kernels
- batch size
- data augmentation (to get more data)
- pre-training the network on another (larger) dataset
- learning rate schedule
- adding batch normalization
- use GPUs to speed up the training
- … feel free to use any trick you find online or which you invent yourself.

**Beware!** You are supposed to fine-tune the network using your validation set. The number of uploads is not limited, but keep it small. If we see unreasonably many uploads for one student, the network could be disqualified as over-fitted to the test data!

courses/be5b33rpz/labs/cnn/start.txt · Last modified: 2019/12/18 23:37 by sochmjan