courses:b3b33vir:tutorials:hw1:start [CourseWare Wiki]

HW 1 - gesture recognition from static images

Requirements:
OS: Native Linux distribution or Linux virtual machine (preferably ubuntu 16 or later, but 14 will probably work fine)
Python: Python3. Python2 might work as well but you should really be using python3 nowadays.
Other libraries: Pytorch, OpenCV. Both libraries can be installed in a matter of minutes, in one command through pip, conda, etc.
Webcam: Any webcam should work.
Optional: Install Cuda if you have an Nvidia GPU (contact me if you have issues with it). You can get a 10-30x speedup over CPU if using Cuda. Also, if you are Linux-savy then feel free to install Pytorch from sources (it pretty simple). You might get an additional 1-3x speedup, especially on CPU if you install from sources due to the use of SIMD instructions such as AVX512.

Simple static gesture recognition using convolutional neural networks.

In this homework assignment you will be training a CNN to classify 7 different hand gestures. The point of this excercise is to familiarize the student with the pipeline and procedure for training a deep learning classifier. We will be using python as a programming language and the Pytorch deep learning framework.

The assignment consists of the following tasks:

Gathering a dataset
Implementing a CNN architecture
Training and testing the CNN

Download the prepared scripts here and familiarize yourself with them.

Gathering a dataset:

Gather an adequate amount of examples for each gesture. For this you can use a prepared class in the cam_control.py file which implements functions enabling the capture of images from a webcam. An embedded laptop webcam will work fine. In the “main” file you can set the appropriate boolean flags to enable gathering/training/testing. If you can't get the camera working on your machine for some reason, feel free to use your colleagues machine to perform the dataset gathering.

Tips on gathering a dataset:

Practice the gestures beforehand
Make sure the gestures include variety
Make sure the background is static
Perform the gathering in small batches of 50-150 images, with breaks in between
Make sure that your hand gesture takes up at least 50% of the camera space

Note
Remember that the NN will fit to the simplest thing that will allow it to distinguish the classes. In other words, if you are smiling the whole time you are performing gesture A, and frowning at gesture B, then the NN might completely ignore your gestures and fit onto the facial expression.

When using the data gathering script in main.py, any dataset gathered will be appended to the old one.

Implementing a CNN architecture:

Implement a CNN which maps (128,128) input grayscale float images with range [0,1] to a given class label. You will be training 6 gestures: (“front”,“back”,“right”,“left”,“up”,“down”) which means that you require 7 labels, with the last one being for the “no-gesture” gesture, “noop”. A template for the CNN class is given in the networks.py. You are to fill the initialization, the forward() function and the training, testing and prediction procedures for which templates are given. You are free to implement whatever architecture you want. A good place to start would be 3-4 convolutional layers with Relu and maxpooling in between, and 2 fully connected layers with the last layer having 7 units, one for each class. Use a cross-entropy loss between the units and labels and an optimizer with momentum to train the network.

Add your implementation to the networks.py file.

Training and testing the CNN:

Train your architecture on your dataset and note both training and test errors. Depending on your architecture, the amount of iterations that you choose and your machine, the training should take roughly 7-30 minutes on CPU and 20-60 seconds if using GPU. You can then use the last part of the main script to test the real life performance of your trained classifier on a webcam stream which will draw an arrow in the direction of the predicted label by your classifier as well as printing out the label.

Note:
To test your NN architecture gather a small dataset and run your network on it. The first sign that everything is ok is that you are able to almost perfectly fit (memorize) the dataset, regardless of the bad test error.

Also, use various techniques that you have learned in the lectures to prevent overfitting; L2 regularization, dropout, early stopping (this one helps alot here. Don't perform too many iterations)

Optional: In the networks.py in the classifier class implement a method which returns a confusion matrix for the test data (or subset thereof) so that you can see which gestures are being confused by the classifier the most. This can help you can think of ways to improve the gestures so that they are not confused.

Evaluation and grading

Your code should be uploaded by the deadline and will be evaluated individually. It is mandatory that your dataset loads, and goes through the training and test procedures successfully.

What results to expect and possible improvements

If you have gathered a decent dataset and correctly implemented the CNN and training procedures then you will have noted that you can fit the dataset almost perfectly, with a mean crossentropy of < 0.1. However, the mean crossentropy for the test set is around 0.2-0.6 (depending heavily on how good the dataset is. This is obviously an overfit. The solution in this case is very simply to gather a much larger dataset (uncorrelated examples). There will still be an issue that this detector will likely not work that well if the camera is moved to a different location or for different people. However, there are much better approaches which takes care of these problems.

One such approach would be to use a pose estimation framework which was trained on a very large variable dataset to predict the complete pose of the joints of the human body and fingers in the hand. Such a framework can be used to predict the (x,y) coordinates of the shoulder, elbow, wrist and fingers which can then be used as further input into a neural network which will be trained to recognize gestures.