Support Vector Machines (SVM)

In the previous lab we have switched from the generative approach of classifier design to the discriminative one. Particularly, we have implemented the Perceptron algorithm. In this lab, we will continue examining the discriminative domain by implementing popular Support Vector Machines (SVM) classifier.

The perceptron algorithm has despite its simplicity several drawbacks. The first drawback is that it finds just some separating hyperplane (decision boundary), but ideally we would like to find the optimal separating hyperplane (at least in some sense). Another important drawback is, that perceptron cannot deal with noisy data.

The SVM are designed to handle both of these drawbacks. Given an annotated training data set $$ \mathcal{T} = \{ (\mathbf{x}_i, y_i)\}_{i = 1}^{m} \quad,\ \text{where} \ \mathbf{x}_i \in \mathbb{R}^n,\ y_i \in \{-1, 1\} $$ the SVM primary task is formulated as $$ \mathbf{w^*}, b^*, \xi_i^* = \arg\min_{\mathbf{w}, b, \xi_i} \left( \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^{m}{\xi_i} \right) $$ subject to

\begin{align} \langle\mathbf{w}, \mathbf{x}_i\rangle + b & \ge +1 - \xi_i, \quad y_i = +1, \\ \langle\mathbf{w}, \mathbf{x}_i\rangle + b & \le -1 + \xi_i, \quad y_i = -1, \\ \xi_i & \ge 0,\ \forall i. \end{align}

Here, and in the following text, $\langle\mathbf{a}, \mathbf{b}\rangle$ denotes dot product $\mathbf{a}^\top \mathbf{b}$. As we can see, the SVM is posed purely as an optimization task.

The first term in the above optimisation task maximises so called margin, $2/||\mathbf{w}||$, by actually minimising $||\mathbf{w}||$. The margin is the distance between the separating hyperplane and the closest training points (intuitively, the further is the separating hyperplane from the training data, the better it generalises to unseen data). Thus the SVM are sometimes referred to as Maximum margin classifier. More detailed explanation can be found in [1, Section 3 - Linear Support Vector Machines].

The sum in the minimisation task is over so called slack variables, $\xi_i$. They allow some training points to be incorrectly classified (compare the separating hyperplane equation in the optimisation constraints with the one in Perceptron) by paying certain penalty controlled by the user specified constant $C$. This enables SVM to deal with noisy data (linearly non-separable due to noise).

Although intuitive, the primal task formulation is difficult to optimize. Thus, the parameters of the optimal separating hyperplane are usually sought by solving the dual problem.

The reason for solving rather the dual task is, that the separating hyperplane is fully specified only by a small number of points (compared to the usually very numerous training set), called support vectors. The derivation of the dual task can be found e.g. in [1] or [2]. Note that to obtain the dual formulation, the student should be familiar with the Lagrange multipliers.

The dual task is formulated as $$ \mathbf{\alpha^*} = \arg\max_{\mathbf{\alpha}} \left( \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_i \alpha_j y_i y_j \langle\mathbf{x}_i, \mathbf{x}_j\rangle \right), $$

subject to

\begin{align} C \ge \alpha_i & \ge 0,\quad i = 1, \dots, m \\ \sum_{i=1}^{m} \alpha_i y_i & = 0 \end{align}

The transformation of the dual solution to the primal variables is done as follows:

\begin{align} \mathbf{w} & = \sum_{i=1}^{m} \alpha_i y_i \mathbf{x}_i, \\ b & = \frac{1}{| I |} \sum_{i \in I} (y_i - \sum_{j=1}^{m} \alpha_j y_j < \mathbf{x}_j, \mathbf{x}_i > ), \end{align}

where $I$ denotes the set of support vector indices.

Note that the original feature vectors $\mathbf{x}_i$ appear in the dual task in the form of dot product. This is very important and will be used in the next assignment!

Let us rewrite the dual task in matrix form:

$$ \mathbf{\alpha^*} = \arg\max_{\mathbf{\alpha}} \left( \langle\mathbf{\alpha}, \mathbf{1}\rangle - \frac{1}{2} \mathbf{\alpha}^\top \mathbf{H} \mathbf{\alpha} \right), $$

subject to

\begin{align} \langle\mathbf{\alpha}, \mathbf{y}\rangle & = 0, \\ \mathbf{0} \le \mathbf{\alpha} & \le \mathbf{C} \end{align}

As you can see, the dual task is formulated as an Quadratic programming task.

Having found the parameters of the optimal separating hyperplane, the classification of some data $\mathbf{x}$ using SVM is performed by a simple dot product and testing the sign of the result

\begin{align} h(\mathbf{x}; \mathbf{w}) & = \text{sign}(\langle\mathbf{x}, \mathbf{w}\rangle + b), \end{align}

which is very fast.

To fulfil this assignment, you need to submit these files (all packed in a single .zip file) into the upload system:

answers.txt - answers to the Assignment Questions
assignment07.m - a script for data initialisation, calling of the implemented functions and plotting of their results (for your convenience, will not be checked)
my_svm.m - a function which implements the learning of the linear svm classifier
classif_lin_svm.m - a function which for given data outputs classified labels.
compute_TstErr.m - a function computing mean error on test data from cross-validation
compute_measurements_2d - a function which computes 2D features for given structure with data
linear_svm.png, ocr_svm_tst.png, ocr_svm_classif.png - images specified in the tasks

Start by downloading the template of the assignment.

Taks, Part 1: Soft margin SVM

Complete function [ w, b, sv_idx ] = my_svm( X, y, C, options ), which solves the SVM optimization task in its dual form. For solving the quadratic program use the gsmo mex-function from the STPR Toolbox. Output of this function are parameters of the separating hyperplane $\mathbf{w},\ b$ and indices of support vectors sv_idx.

Hint: Look carefully at the gsmo help and compare it with the dual problem formulation.

pre-compiled gsmo: linux64, win64, mac64, linux32, win32, or if you want to compile it yourself look at How to Compile MATLAB Mex Files.
Complete the template function classif = classif_lin_svm(X, model).
The function applies trained SVM on given data and outputs classification labels for all vectors in $X$.
$X$ is a matrix [$n \times m$], where $n$ is a dimensionality of features and $m$ is the number of examples and model is a structure with fields w and b. The output is vector [1 $\times m$] $y \in \{-1, 1\}$.

Try out function my_svm on toy dataset (data_33rpz_svm_toy.mat) and experiment with the values of parameter $C$.
Hint: You can limit maximal number of iterations for gsmo algorithm by passing field .tmax in the options structure. Default value is inf (which might for linearly non-separable data cause infinite loop for certain values of $C$).

X = [1, 2, 1, -1 -1 -2; 1, 1, 2, -1, -2, -1];
y = [1, 1, 1, -1, -1, -1];
C = 1;
 
[w, b, sv_idx]  = my_svm(X, y, C)
 
Settings of QP solver
nrhs   : 11
nlhs   : 5
tmax   : 2147483647
tolKKT : 0.001000
n      : 6
verb   : 1
t=1, KKTviol=2.000000, tau=0.250000, tau_lb=0.000000, tau_ub=1.000000, Q_P=-0.250000
t=2, KKTviol=0.000000, tau=0.250000, tau_lb=0.000000, tau_ub=1.000000, Q_P=-0.250000
 
w =
 
    0.5000
    0.5000
 
 
b =
 
     0
 
 
sv_idx =
 
     1
     4

Set the value of $C$ such that the resulting classifier has zero error on the training dataset (i.e. the data which you have used for training the SVM - data_33rpz_svm_toy.mat), keep the indices of support vectors in this case. You will need them to answer one of the assignment questions.

Hint 1: You can use function classif_lin_svm and compare its output with the vector $y$ used for training the SVM.
Hint 2: For plotting you can use function linclass, ppatterns and pboundary from the SPTR Toolbox:
```
model.W = w;
model.b = b;
model.fun = 'linclass';
ppatterns(X, y); 
pboundary(model);
```
Plot results for $C = 1$ and emphasize the support vectors. Save the figure linear_svm.png and submit it with your solution.

Tasks, Part 2: Optimal parameter selection

Similarly to the Parzen Window task, the optimal value of the hyper-parameter $C$ can be found using the cross-validation technique. To measure the quality of the parameter we will compute the average error on test data over all folds and select such value of $C$, which minimizes this error.

Do the following:

Prepare the OCR data for SVM training:
1. Complete the template function [X, y] = compute_measurements_2d(input_struct).
  The function computes 2D features from the images and you need to fill in the necessary normalization (it is needed for gsmo to converge). The output of this function is
  - $X$ is a matrix $2 \times m$, $X = [\mathbf{a}; \mathbf{b}]$, $\mathbf{a} = [a_1, \dots, a_m]$, $\mathbf{b} = [b_1, \dots, b_m]$
    - $a_i = $ (sum of pixel intensities in the left half of the image i) - (sum of pixel intensities in the right half of the image i)
    - $b_i = $ (sum of pixel intensities in the top half of the image i) - (sum of pixel intensities in the bottom half of the image i)
    - Normalize both $\mathbf{a},\ \mathbf{b}$ as follows:
      $\mathbf{c}_{\text{norm}} = \frac{\mathbf{c} - \mathbf{\bar{c}}}{\sigma_{\mathbf{c}}}, \quad c \in \{\mathbf{a}, \mathbf{b} \}, $
      where $\mathbf{\bar{c}}$ denotes the average of $\mathbf{c}$ and $\sigma_{\mathbf{c}}$ denotes the standard deviation of $\mathbf{c}$. Normalization of data is very important. The normalization which we use here is referred to as standardization and is commonly used in machine learning. Note that by this normalization, the feature points are transformed in such way, that they have zero mean and unit variance. When classifying new data you should first standardize them with the same $\mathbf{\bar{c}}$ and $\sigma_{\mathbf{c}}$.
  - $y$ is a vector $1 \times m$ of labels $y \in \{-1, 1\}$
    Hint: Labels that had originally value 2 will now have value -1.
Estimate the optimal value of the parameter $C$ using the 10-fold cross-validation. The procedure is as follows:
1. Split the training set into 10 subsets using the crossval function (the one from STPR Toolbox!).
2. Complete the template function TstErr = compute_TstErr(itrn, itst, X, y, C).
  The function trains an SVM classifier on $X(\text{itrn{i}})$ and computes the error $\text{Err(i)}$ (i.e. number of miss-classifications) on $X(\text{itst{i}})$. The output is the mean test error $\text{TstErr}$ over all 10 folds.
3. Use the function compute_TstErr to perform cross-validation for $C \in \{0.001, 0.01, 0.1, 1, 10 \}$
4. Find the optimal value of $C$, i.e. such $C$ from the given set that minimizes the average test error. You will need this value in the questions below.
Train an SVM classifier with optimal $C$.
Compute the classification error of the trained SVM classifier on the test data. You will need this value in the questions below.
Plot the optimal separating hyperplane (optimal $C$) together with the test data in one plot and save it as ocr_svm_tst.png

Hint: see assignment07.m where the plotting is already prepared for you.
Show the classification of the letters with the optimal classifier as in the previous labs and save it as ocr_svm_classif.png.

Assignment Questions

Fill the correct answers to your answers.txt file. For the following questions it is assumed that you run rand('seed', 42); always before the splitting of the training set using crossval function. The dataset split will be unambiguously defined.

Imagine that some bad guy deleted all training data points except for the support vectors, but you still want to train SVM on this data. How will the absence of the majority of the training data points influence the resulting separating hyperplane compared to the one trained on full data?
- a) The absence of data points will have negative impact on the optimality.
- b) One cannot simply train SVM when some data points are missing.
- c) It will be exactly the same.
- d) The absence of data points will have positive impact on the optimality.
How can we enforce the soft-margin SVM to behave like a hard margin one (i.e. not allowing errors on training set)?
- a) Set value of $C$ to $0$.
- b) Set value of $C$ to $N$ (where $N$ is the size of the training set).
- c) It is not possible.
- d) Set value of $C$ to $\infty$.
What are the indices of the support vectors from task part 1 (point 4) for the case when training data are flawlessly classified into positive and negative class?
What is the optimal value of $C$ from the crossvalidation (Part 2, 2.IV)?
What is the classification error of the trained SVM on test data (Part 2, 4)?

Bonus task

Try to combine several binary SVM classifiers into a multiclass one. Use the “one versus all” technique (see e.g. this link for the general idea).

Use MNIST data to create a multiclass classifier of images of numerals. The file contains 30 000 examples of normalized images of digits stored in matrix $X$ and corresponding labels (the label is the digit itself) $y \in \{0, 1, \dots, 9\}$. Use actual pixel values as features, in this case your feature space will be 784 dimensional.

To display one example, you can use this code:

imshow(reshape(X(:, 1), imsize)', []);

Display the resulting classification (i.e. the montage of all images classified as $0, 1, \dots, 9$).

Hint: use all relevant functions, which you have implemented in this assignment for solving this task.
Hint: use some reasonable subset of data for training and testing, training with 30 000 examples would need extensive amounts of ram and would take too much time. Preserve proportions of classes when doing subsampling.

Recommended Literature

[1] Christopher J. C. Burges. A Tutorial On Support Vector Machines for Pattern Recognition
[2] Text of exercise from previous course
[3] Lecture slides
[4] Quadratic programming
[5] Lagrange multiplier

Table of Contents