====== Homework 3 ======

<note important>
For the third homework, the maximum number of uploads is limited to 5. A detailed explanation of this step is on its [[./start|own page]].
</note>


==== SVM classifier training ====

One of the most popular classifiers is [[https://en.wikipedia.org/wiki/Support-vector_machine|SVM]] (support vector machine). Let us have $n$ observations $\mathbf x_1,\dots,\mathbf x_n\in\mathbb R^d$. For simplicity, let's assume that this is a binary classification with labels $y_1,\dots,y_n\in\{-1,1\}$, and that the bias is already embedded in the data. SVM searches for a linear separating hyperplane $\mathbf w^\top \mathbf x$ by introducing an auxiliary variable $\xi_i\ge 0$ for each observation $i$ that measures the misclassification rate. It achieves this using a condition
$$
\xi_i \ge 1 - y_i\mathbf w^\top \mathbf x_i.
$$
Since we will want to minimize $\xi_i$, we will reach its minimum value for $y_i\mathbf w^\top \mathbf x_i\ge 1$. Thus, for the label $y_i=1$ we want the condition $\mathbf w^\top \mathbf x_i\ge 1$ and similarly for the negative label. This is a stronger condition than the classical $\mathbf w^\top \mathbf x_i\ge 0$. In other words, SVM tries to achieve a certain distance of well-classified observations from the separating hyperplane, which adds robustness.

The entire optimization problem can then be written as
\begin{align*}
\operatorname*{minimize}_{w,\xi}\qquad &C\sum_{i=1}^n\xi_i + \frac12||\mathbf w||^2 \\
\text{under conditions}\qquad &\xi_i \ge 1 - y_i\mathbf w^\top \mathbf x_i, \\
&\xi_i\ge 0.
\end{align*}
The regularization $\frac12||\mathbf w||^2$ and the regularization constant $C>0$ appeared in the objective function. Although this problem can be solved in this formulation, it is converted to a dual formulation by default
\begin{align*}
\operatorname*{maximize}_{z}\qquad &-\frac12\mathbf z^\top Q\mathbf z + \sum_{i=1}^nz_i \\
\text{under conditions}\qquad &0\le z_i\le C,
\end{align*}
where the matrix $Q$ has elements $q_{ij}=y_iy_j\mathbf x_i^\top \mathbf x_j$. There is a connection between the primal and dual problems. First, their optimal values coincide. Second, when we solve the dual problem, the solution of the primal problem is obtained by $\mathbf w=\sum_{i=1}^ny_iz_i\mathbf x_i$.

This problem can be solved using the method [[https://en.wikipedia.org/wiki/Coordinate_descent|coordinate descent]]. This method updates only one component of the vector $\mathbf z$ in each iteration. Therefore, in each iteration, we fix some $i$ and we replace the optimization over $\mathbf z$ by the optimization over $\mathbf z+d\mathbf e_i$, where $\mathbf e_i$ is the zero vector with one on component $i$. In each iteration we then solve
\begin{align*}
\operatorname*{maximize}_d\qquad &-\frac12(\mathbf z+d\mathbf e_i)^\top Q(\mathbf z+d\mathbf e_i) + \sum_{i=1}^nz_i + d \\
\text{under conditions}\qquad &0\le z_i + d\le C.
\end{align*}
This optimization problem is simple since $d\in\mathbb R$ and there is a closed-form solution. This procedure is repeated for a given number of epochs, where one epoch performs the above-mentioned iterations for $i=1$, $i=2$ up to $i=n$.

==== Input ====

Implement the functions ''Q = computeQ(X, y)'', ''w = computeW(X, y, z)'', ''z = solve_SVM_dual(Q, C; max_epoch)'' and ''w = solve_SVM (X, y, C; kwargs...)''. This diagram shows both input and output arguments. In more detail, these arguments are:

  * ''X'': matrix of $n\times d$ input data;
  * ''y'': vector of $n$ input labels;
  * ''C'': positive regularization constant;
  * ''w'': vector of the solution of the primal problem $\mathbf w$;
  * ''z'': vector of the solution of the dual problem $\mathbf z$;
  * ''Q'': matrix of the dual problem $Q$.

The ''computeQ'' function computes the ''Q'' matrix and the ''computeW'' function computes the ''w'' from the dual solution. The ''solve_SVM_dual'' function takes as inputs the parameters of the dual problem and does ''max_epoch'' epochs (therefore ''n*max_epoch'' iterations) of the coordinate descent method. The solution should be initialized as $z=0$. The ''solve_SVM'' function combines the previous functions into one.

Finally, write the function ''w = iris(C; kwargs...)'' which will load the iris dataset, as positive class it will consider ''versicolor'', as negative class ''virginica'', as features it will use ''PetalLength '' and ''PetalWidth'', normalizes these inputs, adds a bias (as the last column of $X$), and finally uses the functions written above to calculate the optimal separating hyperplane ''w''.


==== Recommended testing ====

Writing the following tests is not necessary. However, it can help you to debug your code. For example, you can:
   * Verify the correctness of the dual solution by trying a large number of different values of $d$.
   * Verify the equality of the optimal values of the primal and dual problem.
   * Verify correct propagation of ''kwargs'' by replacing them with different values of ''max_epoch''.
   * Anything else.