One of the most popular classifiers is SVM (support vector machine). Let us have $n$ observations $\mathbf x_1,\dots,\mathbf x_n\in\mathbb R^d$. For simplicity, let's assume that this is a binary classification with labels $y_1,\dots,y_n\in\{-1,1\}$, and that the bias is already embedded in the data. SVM searches for a linear separating hyperplane $\mathbf w^\top \mathbf x$ by introducing an auxiliary variable $\xi_i\ge 0$ for each observation $i$ that measures the misclassification rate. It achieves this using a condition $$ \xi_i \ge 1 - y_i\mathbf w^\top \mathbf x_i. $$ Since we will want to minimize $\xi_i$, we will reach its minimum value for $y_i\mathbf w^\top \mathbf x_i\ge 1$. Thus, for the label $y_i=1$ we want the condition $\mathbf w^\top \mathbf x_i\ge 1$ and similarly for the negative label. This is a stronger condition than the classical $\mathbf w^\top \mathbf x_i\ge 0$. In other words, SVM tries to achieve a certain distance of well-classified observations from the separating hyperplane, which adds robustness.

The entire optimization problem can then be written as \begin{align*} \operatorname*{minimize}_{w,\xi}\qquad &C\sum_{i=1}^n\xi_i + \frac12||\mathbf w||^2 \\ \text{under conditions}\qquad &\xi_i \ge 1 - y_i\mathbf w^\top \mathbf x_i, \\ &\xi_i\ge 0. \end{align*} The regularization $\frac12||\mathbf w||^2$ and the regularization constant $C>0$ appeared in the objective function. Although this problem can be solved in this formulation, it is converted to a dual formulation by default \begin{align*} \operatorname*{maximize}_{z}\qquad &-\frac12\mathbf z^\top Q\mathbf z + \sum_{i=1}^nz_i \\ \text{under conditions}\qquad &0\le z_i\le C, \end{align*} where the matrix $Q$ has elements $q_{ij}=y_iy_j\mathbf x_i^\top \mathbf x_j$. There is a connection between the primal and dual problems. First, their optimal values coincide. Second, when we solve the dual problem, the solution of the primal problem is obtained by $\mathbf w=\sum_{i=1}^ny_iz_i\mathbf x_i$.

This problem can be solved using the method coordinate descent. This method updates only one component of the vector $\mathbf z$ in each iteration. Therefore, in each iteration, we fix some $i$ and we replace the optimization over $\mathbf z$ by the optimization over $\mathbf z+d\mathbf e_i$, where $\mathbf e_i$ is the zero vector with one on component $i$. In each iteration we then solve \begin{align*} \operatorname*{maximize}_d\qquad &-\frac12(\mathbf z+d\mathbf e_i)^\top Q(\mathbf z+d\mathbf e_i) + \sum_{i=1}^nz_i + d \\ \text{under conditions}\qquad &0\le z_i + d\le C. \end{align*} This optimization problem is simple since $d\in\mathbb R$ and there is a closed-form solution. This procedure is repeated for a given number of epochs, where one epoch performs the above-mentioned iterations for $i=1$, $i=2$ up to $i=n$.

Implement the functions `Q = computeQ(X, y)`

, `w = computeW(X, y, z)`

, `z = solve_SVM_dual(Q, C; max_epoch)`

and `w = solve_SVM (X, y, C; kwargs…)`

. This diagram shows both input and output arguments. In more detail, these arguments are:

`X`

: matrix of $n\times d$ input data;`y`

: vector of $n$ input labels;`C`

: positive regularization constant;`w`

: vector of the solution of the primal problem $\mathbf w$;`z`

: vector of the solution of the dual problem $\mathbf z$;`Q`

: matrix of the dual problem $Q$.

The `computeQ`

function computes the `Q`

matrix and the `computeW`

function computes the `w`

from the dual solution. The `solve_SVM_dual`

function takes as inputs the parameters of the dual problem and does `max_epoch`

epochs (therefore `n*max_epoch`

iterations) of the coordinate descent method. The solution should be initialized as $z=0$. The `solve_SVM`

function combines the previous functions into one.

Finally, write the function `w = iris(C; kwargs…)`

which will load the iris dataset, as positive class it will consider `versicolor`

, as negative class `virginica`

, as features it will use `PetalLength `

and `PetalWidth`

, normalizes these inputs, adds a bias (as the last column of $X$), and finally uses the functions written above to calculate the optimal separating hyperplane `w`

.

Writing the following tests is not necessary. However, it can help you to debug your code. For example, you can:

- Verify the correctness of the dual solution by trying a large number of different values of $d$.
- Verify the equality of the optimal values of the primal and dual problem.
- Verify correct propagation of
`kwargs`

by replacing them with different values of`max_epoch`

. - Anything else.

courses/b0b36jul/en/hw/hw3.txt · Last modified: 2023/09/20 16:24 by machava2