Search
In all the previous labs, we were trying to design a classifier $q(\mathbf{x}) : \mathcal{X} \rightarrow D$ by using the a priori probabilities $p_K(k)$ and the probability density functions $p_{X|k}(x|k)$. We were minimising the Bayesian risk or some more complex criterion. This kind of approach is commonly referred to as generative, because we try to estimate the full probabilistic model of all variables (we try to find the probabilistic model which “generated” the training samples).
Sometimes, this is however not practical or even possible, e.g. it is not possible to get enough samples to get a good estimate of $p_{X|k}(\mathbf{x}|k)$, so instead we try to approximate the decision function $q(\mathbf{x})$ directly without any knowledge of the underlying probability distribution. This approach is referred to as discriminative, because we are only interested in training the classifier to make correct decisions and we do not care about the actual probabilistic model.
In the discriminative approach, we define a criterion (emperical risk) $R_{\mbox{emp}}(q)$ which estimates the risk of the classifier using the training set. Typically, we calculate $R_{\mbox{emp}}(q)$ as the classification error of the classifier $q$ on the training set $\mathcal{T}$) and we try to find a classifier $q^*$ which minimizes the emperical risk $$\mathcal{T} = \{(\mathbf{x}_i, y_i); \mathbf{x}_i\in R^n, y_i\in\{0,1\}\}_{i=1,\ldots,N} $$ $$R_{\mbox{emp}}(q) = \frac{1}{N}\sum_{i=1}^N W(q(\mathbf{x}_i), y_i) $$ $$q^* = \arg\min_q R_{\mbox{emp}}(q) $$ where $W$ denotes the loss function (typically the zero-one loss function is used). Note that in the discriminative model task formulation no probability densities are used. In this lab we will also strictly use bold letters for vectors and regular font for scalars.
General information for Python development.
To fulfil this assignment, you need to submit these files (all packed in a single .zip file) into the upload system:
.zip
perceptron.ipynb
perceptron.py
perceptron
lift_dimension
classif_quadrat_perc
perceptron_linear.png
perceptron_quadratic.png
Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.
Today's task is to implement the perceptron algorithm for learning a linear classifier. In the case of a linear classifier for two classes ($|K|=2$), the discriminative function $q(\mathbf{x}): \mathcal{X} \rightarrow \{0, 1\} $ can be expressed as
$$ q(\mathbf{x}) = \left\{ \begin{array}{l l} 0 & \quad \mathbf{x} \cdot \mathbf{w} + b > 0 \\ 1 & \quad \mathbf{x} \cdot \mathbf{w} + b < 0 \end{array} \right. $$
where $\mathbf{x} = (x_1, \ldots, x_n)^T$, $\mathbf{w} = (w_1, \ldots, w_n)^T$, and $\mathbf{x} \cdot \mathbf{w} = \sum_{i=1}^n x_i w_i$ denotes the dot product and the dimensionality of $\mathcal{X}$ is $n$. In geometrical interpretation, the expression $\mathbf{x} \cdot \mathbf{w} + b = 0$ is an equation of a line (or a hyperplane in higher dimensions) which divides the space into two subspaces and the value of the discriminative function $q(\mathbf{x})$ represents in which of the two subspaces the vector $\mathbf{x}$ is located.
Thus, the task of finding a linear classifier for two classes can be formulated as finding parameters $\mathbf{w}$ and $b$ so that the following system of linear inequalities holds for each item $(\mathbf{x}_i, y_i)$ in the training set: $$\mathbf{w} \cdot \mathbf{x}_i + b > 0 \quad \mbox{if}\;y_i = 0$$ $$\mathbf{w} \cdot \mathbf{x}_i + b < 0 \quad \mbox{if}\;y_i = 1$$
The solution of the system exists only if the training data are linearly separable, in which case the solution can be found using the perceptron algorithm.
The system of inequalities can be equally rewritten as $$\mathbf{v} \cdot \mathbf{z}_i > 0$$ if we substitute $$\mathbf{v} = [\mathbf{w}; b]$$ $$\mathbf{z}_i = [\mathbf{x}_i; 1] \quad \mbox{if}\;y_i = 0$$ $$\mathbf{z}_i = -[\mathbf{x}_i; 1] \quad \mbox{if}\;y_i = 1$$ (try it yourself on a piece of paper - all that happened here is that we made a 'clever' substitution so that we still have a system of $N$ inequalities, but now all inequalities have the same format).
Now we can easily describe the algorithm:
The Perceptron Algorithm
Tasks
w, b = perceptron(X, y, max_iterations)
X
y
max_iterations
w, b = nan, nan
X = np.array([[1, 1, 2, 2], [1, 3, 5, 6]]) y = np.array([0, 0, 1, 1]) w, b = perceptron(X, y, 100) print('your solution may differs\nw: ', w, '\nb: ', b) # -> your solution may differs # -> w: [ 2. -3.] # -> b: 9.0 X = np.array([[3, 1, 2, 2], [2, 3, 5, 6]]) y = np.array([0, 0, 1, 0]) w, b = perceptron(X, y, 100) print('w: ', w, '\nb: ', b) # -> w: nan # -> b: nan
As we have seen in the last point of the first task, for some data a linear discriminative function that would separate the data does not exist.
In order to deal with this problem, we will extend the perceptron algorithm so that it learns quadratic discriminant function (see [2]). We define a mapping function $\phi: \mathcal{X} \rightarrow \mathcal{Z}$ which projects the training samples $\mathbf{x} \in \mathcal{X}=\mathbb{R}^2$ into a higher-dimensional space $\mathcal{Z}$.
In our task, we will consider the discriminative function in the form $$q'(\mathbf{x}) = w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_1x_2 + w_5x_2^2 + b$$
By comparing this discriminative function $q'(\mathbf{x})$ to the linear case $q(\mathbf{x})$, we can derive the formula for the feature vector $\mathbf{z} \in \mathcal{Z}$ in the lifted space:
$$\mathbf{z} = \phi(\mathbf{x})= (x_1, x_2, x_1^2, x_1x_2, x_2^2)^T$$
The entire extension of the perceptron algorithm then goes as follows:
Z = lift_dimension(X)
X = [1 0 2; 3 5 1]; Z = lift_dimension(X); Z = 1 0 2 3 5 1 1 0 4 3 0 2 9 25 1
Y = classif_quadrat_perc(tst, model)
tst
Y
model
pboundary
w
b
plt.ginput()
1. Why do we use discriminative model instead of the generative one?
Solution
a) Yes, but no :) b) Yes. c) No. We could not train a discriminative model without labels. Also, we have labels for perceptron (discriminative model). d) No. We just used a discriminative model (perceptron) on linearly non-separable data in the previous section (perceptron_quadratic.png).
2. Which statements about the empirical risk $R_{\mbox{emp}}(q)$ are correct? Select all that apply.
a) No. When trained on "easy" examples, the empirical risk (risk on training data) can be 0, but the Bayesian risk > 0. b) No. It can be equal to 0 (no training error). c) No. It is possible that we have overfitted to the training set and the performance on test data will be poor. d) Yes. (That is why we do validation / cross-validation)
3. When training a linear discriminative function on linearly non-separable data, the perceptron algorithm will
a) No. It does not even know the data are not separable. b) Yes. It will never know if it is getting closer to a solution, or if no solution is possible. c) No. ^^ d) No. Just no.
4. When training a perceptron with a quadratic discriminative function $q'(\mathbf{x})$ (i.e. in a lifted feature space) on linearly non-separable data, the perceptron algorithm will
a) No. It will follow the perceptron algorithm which does not specify exiting with errors b) No. Maybe it will, but it depends on the training data. c) No. Maybe it will, but it depends on the training data. d) Yes. ^^. If the training data can be linearly separated in the lifted feature space, it will find a solution in a finite number of steps, otherwise it will run indefinitely.
Choose one or both of the following bonus tasks.
Implement the Kozinec's algorithm for learning of a linear discriminant function. For a set of points $\{x^j, j \in J\}$, the Kozinec's algorithm finds vector $\alpha$, which satisfies a set of linear inequations $$ \langle\alpha, x^j\rangle > 0, \quad j \in J$$ I.e. the problem is the same as in the Perceptron. The Kozinec's algorithm creates a sequence of vectors $\alpha_1, \alpha_2, \ldots, \alpha_t, \alpha_{t+1}, \ldots$ in the following way.
The algorithm: Vector $\alpha_1$ is any vector from the convex hull of the set $\{x^j, j \in J\}$, e.g. one of the vectors $x^j$, $j \in J$. If the vector $\alpha_t$ has already been computed, $\alpha_{t+1}$ is obtained using the following rules:
Output: Identical to the output to the task 1 in the non-bonus part. Hint: The Kozinec algorithm is described in Chapter 5 of [3]
Implement multi-class perceptron learning algorithm. The mutli-class classifier is formulated as $$ f(\mathbf{x}) = \arg\max_{y\in\cal Y}\,\bigl(\mathbf{w}_y^\top\mathbf{x}+b_y\bigr), $$ where $\cal W=\bigl\{(\mathbf{w}_y, b_y)\mid y\in\cal Y\bigr\}$ is a set of classifier parameters and $\cal Y$ is a set of classes.
The training algorithm was not presented at the lectures, but it is simple and elegant as follows:
Repeat steps 1 and 2 for a data with 3 or more classes.
Plot the multi-class decision boundaries.
[1] Chapter 5.5 of Duda, Hart, Stork: Pattern classification (algorithm 4, page 230) [2] Nonlinear perceptron (short support text for labs) [3] Michail I. Schlesinger, Vaclav Hlavac. Ten Lectures on Statistical and Structural Pattern Recognition. Kluwer 2002 [4] Jiri Matas: Learning and Linear Classifiers slides [5] Franc V. Perceptron Algorithm. (in Czech) [6] R. Sara Multi-class Perceptron