Search
In the previous lab we have switched from the generative approach of classifier design to the discriminative one. Particularly, we have implemented the Perceptron algorithm. In this lab, we will continue examining the discriminative domain by implementing popular Support Vector Machines (SVM) classifier.
The perceptron algorithm has despite its simplicity several drawbacks. The first drawback is that it finds just some separating hyperplane (decision boundary), but ideally we would like to find the optimal separating hyperplane (at least in some sense). Another important drawback is, that perceptron cannot deal with noisy data.
The SVM are designed to handle both of these drawbacks. Given an annotated training data set $$ \mathcal{T} = \{ (\mathbf{x}_i, y_i)\}_{i = 1}^{m} \quad,\ \text{where} \ \mathbf{x}_i \in \mathbb{R}^n,\ y_i \in \{-1, 1\} $$ the SVM primary task is formulated as $$ \mathbf{w^*}, b^*, \xi_i^* = \arg\min_{\mathbf{w}, b, \xi_i} \left( \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^{m}{\xi_i} \right) $$ subject to
\begin{align} \langle\mathbf{w}, \mathbf{x}_i\rangle + b & \ge +1 - \xi_i, \quad y_i = +1, \\ \langle\mathbf{w}, \mathbf{x}_i\rangle + b & \le -1 + \xi_i, \quad y_i = -1, \\ \xi_i & \ge 0,\ \forall i. \end{align}
Here, and in the following text, $\langle\mathbf{a}, \mathbf{b}\rangle$ denotes dot product $\mathbf{a}^\top \mathbf{b}$. As we can see, the SVM is posed purely as an optimization task.
The first term in the above optimisation task maximises so called margin, $2/||\mathbf{w}||$, by actually minimising $||\mathbf{w}||$. The margin is the distance between the separating hyperplane and the closest training points (intuitively, the further is the separating hyperplane from the training data, the better it generalises to unseen data). Thus the SVM are sometimes referred to as Maximum margin classifier. More detailed explanation can be found in [1, Section 3 - Linear Support Vector Machines].
The sum in the minimisation task is over so called slack variables, $\xi_i$. They allow some training points to be incorrectly classified (compare the separating hyperplane equation in the optimisation constraints with the one in Perceptron) by paying certain penalty controlled by the user specified constant $C$. This enables SVM to deal with noisy data (linearly non-separable due to noise).
Although intuitive, the primal task formulation is difficult to optimize. Thus, the parameters of the optimal separating hyperplane are usually sought by solving the dual problem.
The reason for solving rather the dual task is, that the separating hyperplane is fully specified only by a small number of points (compared to the usually very numerous training set), called support vectors. The derivation of the dual task can be found e.g. in [1] or [2]. Note that to obtain the dual formulation, the student should be familiar with the Lagrange multipliers.
The dual task is formulated as $$ \mathbf{\alpha^*} = \arg\max_{\mathbf{\alpha}} \left( \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_i \alpha_j y_i y_j \langle\mathbf{x}_i, \mathbf{x}_j\rangle \right), $$
subject to
\begin{align} C \ge \alpha_i & \ge 0,\quad i = 1, \dots, m \\ \sum_{i=1}^{m} \alpha_i y_i & = 0 \end{align}
In case of hard-margin SVM, the transformation of the dual solution to the primal variables is done as follows:
\begin{align} \mathbf{w} & = \sum_{i=1}^{m} \alpha_i y_i \mathbf{x}_i, \\ b & = \frac{1}{| I |} \sum_{i \in I} (y_i - \sum_{j=1}^{m} \alpha_j y_j < \mathbf{x}_j, \mathbf{x}_i > ), \end{align}
where $I = \{i: 0 < \alpha_i \}$ denotes the set of support vector indices. This however is not correct for soft-margin SVM, where only the $\{i: 0 < \alpha_i < C\}$ indices are used to compute the bias (i.e. the support vectors directly on the margin). Unfortunately, it can easily happen that there is no vector directly on the margin. Because of this, the bias term has to be computed in a more complicated way (see libSVM paper (Section 4.1.5) or an “easy to follow” explanation (exercise 4 solution)). The bias computation code is provided in the assignment template (compute_bias).
compute_bias
Also for numerical reasons, it is more stable to compare $\alpha_i$ with a small non-zero constant (use $10^{-10}$) instead of zero.
Note that the original feature vectors $\mathbf{x}_i$ appear in the dual task in the form of dot product. This is very important and will be used in the next assignment!
Let us rewrite the dual task in matrix form:
$$ \mathbf{\alpha^*} = \arg\max_{\mathbf{\alpha}} \left( \langle\mathbf{\alpha}, \mathbf{1}\rangle - \frac{1}{2} \mathbf{\alpha}^\top \mathbf{H} \mathbf{\alpha} \right), $$
\begin{align} \langle\mathbf{\alpha}, \mathbf{y}\rangle & = 0, \\ \mathbf{0} \le \mathbf{\alpha} & \le \mathbf{C} \end{align}
As you can see, the dual task is formulated as an Quadratic programming task.
Having found the parameters of the optimal separating hyperplane, the classification of some data $\mathbf{x}$ using SVM is performed by a simple dot product and testing the sign of the result
\begin{align} h(\mathbf{x}; \mathbf{w}) & = \text{sign}(\langle\mathbf{x}, \mathbf{w}\rangle + b), \end{align}
which is very fast.
To fulfil this assignment, you need to submit these files (all packed in a single .zip file) into the upload system:
.zip
answers.txt
svm1.ipynb
svm1.py
my_svm
classif_linear_svm
compute_test_error
compute_measurements_2d
linear_svm.png, ocr_svm_tst.png, ocr_svm_classif.png
Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.
my_svm( X, y, C, options )
gsmo
sv_idx
classif_linear_svm(X, model)
model
w
b
data_33rpz_svm_toy.npz
tmax
inf
X = np.array([[1, 2, 1, -1, -1, -2], [1, 1, 2, -1, -2, -1]]) y = np.array([1, 1, 1, -1, -1, -1]) C = np.inf; [w, b, sv_idx] = my_svm(X, y, C) print('\n', 'w: ', w, '\n', 'b: ', b, '\n', 'sv_idx:', sv_idx) # -> t=1, KKTviol=2.000000, tau=0.250000, tau_lb=0.000000, tau_ub=inf, Q_P=-0.250000 # -> w: [0.5 0.5] # -> b: 0.0 # -> sv_idx: [0 3]
classif_lin_svm
plot_points
plot_boundary
linear_svm.png
Similarly to the Parzen Window task, the optimal value of the hyper-parameter $C$ can be found using the cross-validation technique. To measure the quality of the parameter we will compute the average error on test data over all folds and select such value of $C$, which minimizes this error.
Do the following:
compute_measurements_2d(data)
crossval
compute_test_error(itrn, itst, X, y, C)
ocr_svm_tst.png
ocr_svm_classif.png
Fill the correct answers to your answers.txt file. For the following questions it is assumed that you run np.rand.seed(42) always before the splitting of the training set using crossval function. The dataset split will be unambiguously defined.
Try to combine several binary SVM classifiers into a multiclass one. Use the “one versus all” technique (see e.g. this link for the general idea).
Use MNIST data to create a multiclass classifier of images of numerals. The file contains 30 000 examples of normalized images of digits stored in matrix $X$ and corresponding labels (the label is the digit itself) $y \in \{0, 1, \dots, 9\}$. Use actual pixel values as features, in this case your feature space will be 784 dimensional.
Display the resulting classification (i.e. the montage of all images classified as $0, 1, \dots, 9$).
Hint: use all relevant functions, which you have implemented in this assignment for solving this task. Hint: use some reasonable subset of data for training and testing, training with 30 000 examples would need extensive amounts of ram and would take too much time. Preserve proportions of classes when doing subsampling.
[1] Christopher J. C. Burges. A Tutorial On Support Vector Machines for Pattern Recognition [2] Text of exercise from previous course [3] Lecture slides [4] Quadratic programming [5] Lagrange multiplier