Search
In the last lab, we have implemented the soft margin SVM algorithm. We have seen that in contrast to the Perceptron algorithm, the SVM can handle noisy training data (i.e. data which are in fact linearly separable, but due to errors in labels or measurements they are not). This is a very nice and important feature of a classifier, but can we handle somehow also data which are really linearly non-separable? Can we apply the same trick as was used in the Perceptron algorithm, i.e. the dimensionality lifting (aka straightening the feature space)?
In this lab, we will extend the SVM algorithm to the non-linear classification. We will use so called kernel trick, which enables us to use a linear classifier on linearly non-separable data by increasing the dimensionality of the feature space.
Given a fixed nonlinear feature space mapping $\Phi(\mathbf{x})$, the kernel function is defined as follows $$ k(\mathbf{x}, \mathbf{x}') = \Phi(\mathbf{x})^\top\Phi(\mathbf{x}') $$
As we can see, the kernel is a symmetric function of its arguments, i.e. $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}', \mathbf{x})$. The simplest kernel is obtained when $\Phi(\mathbf{x})$ is the identity mapping $\Phi(\mathbf{x}) = \mathbf{x}$ resulting in the linear kernel $$ k_L(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top\mathbf{x}'$$
Recall the classification function from the previous task $ f(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} + b = \sum_{i=1}^m \alpha_i y_i \mathbf{x}_i^\top \mathbf{x} + b$ and the corresponding learning task:
The soft-margin SVM dual task $$ \vec{\alpha} = \arg\max_{\vec{\alpha}} \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j, $$ subject to $$ \begin{align} 0 \le \alpha_i & \le C,\quad i = 1, \dots, m \\ \sum_{i=1}^{m} \alpha_i y_i & = 0 \end{align} $$
Note that in both the learning task and the classification function the original data appear only in the form of a dot product ($\mathbf{x}^\top\mathbf{x}$). So, we could say that we were using the linear kernel SVM with the $k_L$ kernel.
Kernel trick (or kernel substitution)
The trick assumes that we have an algorithm where the input vectors $\mathbf{x}$ are used only in the form of a dot product and uses the idea of expressing a kernel as a dot product in the feature space, $\Phi(\mathbf{x})^\top\Phi(\mathbf{x}')$. By defining the mapping or the kernel we can then substitute it for the dot product.
The biggest advantage of using the kernel is that we do not even have to know the feature space mapping $\Phi(\mathbf{x})$ explicitly. In fact it might even be a mapping to an infinite dimensional space!
Let us define the Gram matrix $\mathbf{K} = \Phi\Phi^\top$, where $\Phi$ is a matrix, whose $i$-th row is given by $\Phi(\mathbf{x}_i)^\top$. The Mercer's theorem gives us necessary and sufficient conditions on the matrix $\mathbf{K}$ to be a valid kernel: matrix $\mathbf{K}$ has to be symmetric and positive semi-definite (i.e. $ \mathbf{x}^\top\mathbf{K}\mathbf{x} \ge 0,\quad \forall \mathbf{x} \in \mathbb{R}^n $).
Here are some commonly used kernels you will need for the assignment:
Try to change parameter C and RBF kernel sigma to see how they influence the result.
To fulfil this assignment, you need to submit these files (all packed in a single .zip file) into the upload system:
.zip
answers.txt
assignment_08.m
getKernel.m
my_kernel_svm.m
classif_kernel_svm.m
compute_kernel_TstErr.m
flower_rbf.png, flower_polynomial.png, ocr_polynomial_kernel_tst.png, ocr_svm_classif.png, mnist_tst_classif.png
Start by downloading the template of the assignment.
[ K ] = getKernel( Xi, Xj, options)
options.kernel
X = [1, 2, 1, -1 -1 -2; 1, 1, 2, -1, -2, -1]; y = [1, 1, 1, -1, -1, -1]; K = getKernel(X, X, c2s({'kernel', 'rbf', 'sigma', 1.0})) K = 1.0000 0.6065 0.6065 0.0183 0.0015 0.0015 0.6065 1.0000 0.3679 0.0015 0.0001 0.0000 0.6065 0.3679 1.0000 0.0015 0.0000 0.0001 0.0183 0.0015 0.0015 1.0000 0.6065 0.6065 0.0015 0.0001 0.0000 0.6065 1.0000 0.3679 0.0015 0.0000 0.0001 0.6065 0.3679 1.0000
K = getKernel(X, X, c2s({'kernel', 'polynomial', 'd', 2})) K = 9 16 16 1 4 4 16 36 25 4 9 16 16 25 36 4 16 9 1 4 4 9 16 16 4 9 16 16 36 25 4 16 9 16 25 36
K = getKernel(X, X, c2s({'kernel', 'linear'})) K = 2 3 3 -2 -3 -3 3 5 4 -3 -4 -5 3 4 5 -3 -5 -4 -2 -3 -3 2 3 3 -3 -4 -5 3 5 4 -3 -5 -4 3 4 5
[ model ] = my_kernel_svm( X, y, C, options)
model
options
model.b
model.fun
getKernel
X = [1, 2, 1, -1 -1 -2; 1, 1, 2, -1, -2, -1]; y = [-1, 1, 1, 1, -1, -1]; C = inf; options = c2s({'verb', 1, 'tmax', inf, 'kernel', 'rbf', 'sigma', 0.01}); model = my_kernel_svm(X, y, C, options) Settings of QP solver nrhs : 11 nlhs : 5 tmax : 2147483647 tolKKT : 0.001000 n : 6 verb : 1 t=1, KKTviol=2.000000, tau=1.000000, tau_lb=0.000000, tau_ub=inf, Q_P=-1.000000 t=2, KKTviol=2.000000, tau=1.000000, tau_lb=0.000000, tau_ub=inf, Q_P=-2.000000 t=3, KKTviol=2.000000, tau=1.000000, tau_lb=0.000000, tau_ub=inf, Q_P=-3.000000 t=4, KKTviol=0.000000, tau=1.000000, tau_lb=0.000000, tau_ub=inf, Q_P=-3.000000 model = fun: 'classif_kernel_svm' sv: [2x6 double] y: [-1 1 1 1 -1 -1] alpha: [6x1 double] options: [1x1 struct] b: 0
[ classif ] = classif_kernel_svm( X, model)
X
pboundary
% X and model are the same as defined above classif = classif_kernel_svm(X, model) classif = -1 1 1 1 -1 -1
flower.mat
flower_rbf.png
flower_polynomial.png
As in the previous lab our kernel SVM formulation contains a hyper-parameter $C$ which has to be tuned. Moreover, by using the kernel we introduced yet another hyper-parameter ($\sigma$ of the $k_{RBF}$ or $d$ of the polynomial kernel). This makes the selection of optimal values for all hyper-parameters a bit more difficult. In this part, we will implement a 2D cross-validation.
[ TstErr ] = compute_kernel_TstErr(itrn, itst, X, y, C, options)
my_kernel_svm
classif_kernel_svm
ocr_data_2D_trn.mat
mat
crossval
compute_kernel_TstErr
ind2sub
sub2ind
ocr_data_2D_tst.mat
ocr_polynomial_kernel_tst.png
ocr_svm_classif.png
Finally, we will apply all the above methods to a real world example similar to what you may encounter in typical pattern recognition problems. We will use the MNIST database of hand-written numerals and will train an SVM classifier with RBF kernel for two numerals 0 and 1. In this case the dimensionality of features is much higher as we are going to use the pixel intensities directly (i.e. we have 784-dimensional measurements). Thus we cannot really plot the separating hyperplane, but we can still learn the SVM, do the cross-validation and classify.
Note that we have already normalized the data for you such that each example has zero mean and unit variance. It is also important to mention here, that the pixel intensities are not the best features we could use. Better results could be obtained with more sophisticated features e.g. the image gradients, local binary patterns or histogram of oriented gradients or some combinations of multiple features together. We refer to Feature detection as a starting point for those who would like a deeper understanding of the feature acquisition topic.
We have also limited the training set just to a relatively small number of examples, while for testing we have much bigger set. This is, of course, not typical as we should use as many training examples as possible. The reason for this is purely educative to ease the computational expenses as we do not expect that you have access to some grid computing system. In machine learning it is quite usual that it takes several hours (even up to days) to learn a classifier. Also note that while in our case learning could be very time consuming, the evaluation on test data is very fast.
mnist_01_trn.mat
mnist_01_tst.mat
mnist_tst_classif.png
show_mnist_classification
Fill the correct answers to your answers.txt file.
question2: [C, d]
question4: [C, sigma]
[1] Text of exercise from previous course [2] Lecture slides [3] Quadratic programming [4] Christopher J. C. Burges. A Tutorial On Support Vector Machines for Pattern Recognition