# (Feature-based) Classification Seminar

Part of the data has been generously provided by the Eyedea Recognition company.

This is supplementary material for (feature-based) classification and decision making under uncertainty for the bachelor’s track X33KUI seminars. Assignments for either first or second seminar are just suggested, not mandatory. For those interested in a more detailed understanding, we suggest books [1]], [[courses:be5b33kui:labs:machine_learning:start#[3]|3] that are easily readable even for a bachelor student. Book [5] is slightly more complicated, but supplies an excellent, deep knowledge on the subject. For practical tasks, we suggest the Matlab toolbox [4], which can be acquired for free.

## 1 Seminar 1

### 1.0 MATLAB intro

To install MATLAB follow the instructions on http://www.fel.cvut.cz/en/user-info/matlab.html.

### 1.1 Bayesian decision-making

#### Exercise 1.1

We know the statistical distribution of male and female heights (see the following table). Infer the most likely gender of a person 168 cm (L) tall.

$$\begin{array}{c||l|l|l|l|l|l||c} \begin{subarray}{c} x \\ \text{cm} \end{subarray} & \begin{subarray}{c} \text{XS} \\ \text{(0–100)} \end{subarray} & \begin{subarray}{c} \text{S} \\ \text{(100–125)} \end{subarray} & \begin{subarray}{c} \text{M} \\ \text{(125–150)} \end{subarray} & \begin{subarray}{c} \text{L} \\ \text{(150–175)} \end{subarray} & \begin{subarray}{c} \text{XL} \\ \text{(175–200)} \end{subarray} & \begin{subarray}{c} \text{XXL} \\ \text{(200-}\infty\text{)} \end{subarray} & \sum \\ \hline \hline P(x|\text{male}) & 0.05 & 0.15 & 0.2 & 0.25 & 0.3 & 0.05 & \boldsymbol 1 \\ \hline P(x|\text{female}) & 0.05 & 0.1 & 0.3 & 0.3 & 0.25 & 0.0 & \boldsymbol 1 \\ \hline \end{array}$$

#### Exercise 1.2

Solve this task for a person that has been picked at random from a group with 60 % of males and 40 % of females.

#### Exercise 1.3

You are given a 10×10 pixel image, and each of the pixels can have one of 256 possible values for intensity. How many features are provided? How many possible values are there for the whole set of features (values in the observation space)? How many possible strategies are there?
You are given an image, and you are told that there is a letter in it $S = \{A,\ldots,Z\}$.

#### Exercise 1.4

Propose a classifier for the previous exercise, which assumes features to be conditionally independent.

#### Exercise 1.5

Can we possibly find an error-free decision strategy (make no mistakes)? For what group of people is the inference based on height of a person particularly bad?

#### Exercise 1.6

We have a bag of old coins that have been subject to various levels of use. Coins of different values may therefore be of different sizes. The value of a coin can be still observed by looking at it. We have a task to partition the coins to the groups of same value according to their values based on their weight. We know that there are coins of values 1, 2, 5 CZK, that is $s \in \{1,2,5\}$. The loss function is:

$$l(s, d) = |h_d - h_s|$$

where $h_s$ is the value of a selected coin and $h_d$ is our decision about the value of the coin.

We have fast and simple to use scales with a 5 g precision. Suggest a classification strategy that would minimize the loss and work. By work we mean actually classifying the coins manually. This may be the case if we have to achieve an absolute precision. This is not possible for a vending machine though, as it may accept coins of various values.
Let’s estimate the expected weight of each of the coins based on an experiment. Pick 100 coins, weigh them and record their values. As a result, we are going to have a training multiset1). We may end up with a following table:

$$\begin{array}{c|rrrrr|c} s/x & \text{5 g} & \text{10 g} & \text{15 g} & \text{20 g} & \text{25 g} & \sum \\ \hline \text{1 CZK} & 15 & 10 & 3 & 0 & 0 & \boldsymbol{28} \\ \text{2 CZK} & 7 & 13 & 16 & 6 & 1 & \boldsymbol{43} \\ \text{5 CZK} & 0 & 1 & 2 & 11 & 15 & \boldsymbol{29} \\ \hline \sum & 22 & 24 & 21 & 17 & 16 & \boldsymbol{100} \\ \end{array}$$

$$\begin{array}{c|rrrrr|c} P(s,x) & \text{5 g} & \text{10 g} & \text{15 g} & \text{20 g} & \text{25 g} & \sum \\ \hline \text{1 CZK} & 0.15 & 0.10 & 0.03 & 0 & 0 & \boldsymbol{0.28} \\ \text{2 CZK} & 0.07 & 0.13 & 0.16 & 0.06 & 0.01 & \boldsymbol{0.43} \\ \text{5 CZK} & 0 & 0.01 & 0.02 & 0.11 & 0.15 & \boldsymbol{0.29} \\ \hline \sum & 0.22 & 0.24 & 0.21 & 0.17 & 0.16 & \boldsymbol{1} \\ \end{array}$$

How many possible strategies are there? Our scales say the coin is 10 g, which class would you choose?

## 2 Seminar 2

### 2.1 Nearest neighbour classifier

#### Exercise 2.1

Create a single nearest neighbour classifier (1-NN) for a person’s gender given with height and age as features. If we changed units for height, would the classifier change?

Note: 1-NN is a relatively good classifier, even though it is quite simple. If we know the actual p. d. [2], then it is possible to show that the probability of a wrong classification is at most twice as bad asymptotically as the size of the training multiset approaches $\infty$.

### 2.2 Linear Classifier

Another way to avoid the need to estimate and represent probabilities is to design discrimination functions $g_s(x)$2) directly. One of the options is the linear classifier, which uses discrimination function in the following form:

$$g_s(\vec{x})=\vec{w}_s^{\top}\vec{x}+b_s$$

where $\vec{w}_s$ is the weight vector and the scalar $b_s$ for a translation3) along the y-axis (also called a bias). An object $\vec{x}$ is classified as a member of the class $s$, whose discrimination function $g_s(\vec{x})$ values is greater than those of the other classes ($argmax_s g_s(\vec{x}$)). Therefore, this task transforms into an optimization problem, where we are looking for parameters of the linear discrimination functions, which minimize some criterion, e.g., number of wrong classifications on the given training multiset.

#### Exercise 2.2

Is it possible for the training multiset in the Figure 1 to be classified by a linear classifier perfectly? If so, sketch such a linear classifier.

Figure 1: Training multiset for a 2D feature space with 4 classes.

## 3 Programming exercise – alphanumeric characters recognition

The task is to classify some letters of car registration plates. Let’s assume that a plate has already been localized in the picture (see Figures 4 and 5). Letter images are normalized to be 10×10 pixels. You have a randomly chosen training data at your disposal. Letter images are represented by row vectors of brightness values of concatenated columns of pixels (see Figure 3). Each row in the input MAT-files represents one image. The train.mat file contains feature vectors and the train_labels.mat file the respective class labels. Images are just for convenience, the important data is in the MAT-files. The result of classification should have the same format as the variable in the train_labels.mat. Each row should contain a class for the letter on the same line in train.mat.
It is important to realize that this is usually everything what a customer has available. After you claim you are ready with your code, your customer, seminar teacher in this case, uses testing data to evaluate your work. The testing data are going to be located in files test.mat and test_labels.mat.

For the implementation use MATLAB. The prepared files are available here kui_classification_students.zip.

### 3.2 Individual Exercise

Implement the following algorithms in order to complete the exercise explained in the previous paragraph.

A Nearest Neighbour Classifier Implement a 1-NN classifier. Show that it performs this task on the training multiset.

Bayesian Classification Implement a naïve Bayes classifier. Assume conditional independence of intensities for each of the pixels. Therefore, we have the following formula for each of the classes:

$$P(\vec{x}|s)=P(x_1|s) \cdot P(x_2|s) \dotsc P(x_n|s)$$

where $x_i$ represents the intensity of the $i$−th pixel.

Note To prevent zero probabilities as a consequence of a relatively small training set, it is necessary to add small positive value to $P(x_i|s)$ for all $i$. If you do not do that, only one zero probability $P(x_i|s)$ cause that $P(\vec{x}|s)=0$.

### 3.3 Example Solution – The Perceptron Algorithm

You have one example solution at your disposal, which uses a perceptron for classification. An example of how to use it can be found in the main.m script.

In the script, it is shown how to load input files, train the perceptron, use the trained perceptron for classification, and get results (a confusion matrix) of the classification.

The basic functions are perceptronLearn(), perceptronClassify() and confusionMatrix().

Figure 2: Examples of normalized letter images of registration plates.

Figure 3: Pixels are represented by a row vector of concatenated columns. First comes the first column, then the second etc. It is obvious that dark columns of the letter J, which are in the extreme right part of the image, are at the end of the data vector.

### 3.4 Auxiliary code description

##### 3.4.1 Main script

The script main calls learning and classification functions for all classifiers. It also loads input data and calls the function for printing results of the classifications. This script makes all needed callings of the functions. However, do not forget comment the callings of the functions that are not implemented at the beginning.

##### 3.4.2 Perceptron classifier

This is the concrete implementation of the Perceptron classifier. See the implementation and the comments for more details.

Perceptron learning

perceptron = perceptronLearn(train, train_labels) - the inputs of the function are the training data and their labels. The output of the function is the structure with the learned classifier. The function firstly makes mapping from the char class labels to the number class labels (“conversion table”) and convert the char labels to the number labels. Finally there is the learning itself as per the following algorithm:

1. Set $\vec{w}_y=\vec{0}$ and $b_y=0$ for all $y \in Y$ ($Y$ - set of all possible labels).
2. Pick a random incorrectly classified input. If there is no such input, then STOP, because the learning has finished, by finding parameters for an errorfree classification of the input data.
3. Let $\vec{x}_t,y_t$ is an improperly classified input and $\hat{y}$ is a classification of $\vec{x}_t$ using the current classifier. Adapt the parameters of the classifier according to the following formulae:
\eqalign{ \vec{w}_{y_t} &= \vec{w}_{y_t} + \vec{x}_t \cr b_{y_t} &= b_{y_t} + 1 \cr \vec{w}_{\hat{y}} &= \vec{w}_{\hat{y}} - \vec{x}_t \cr b_{\hat{y}}&= b_{\hat{y}} - 1 }
4. Continue with step 2.

Perceptron classification

classLabels = perceptronClassify(perceptron, test) - the inputs of the function are the structure with the learned classifier and the testing data. The output of the function is an array with the class labels. Classification follows according to

$$\hat{y}=\arg \max\limits_{y \in Y} \vec{x}_t^\top\vec{w}_y+b_y$$

with the result being converted from the number labels to the char labels and then returned as an output.

##### 3.4.3 Confusion Matrix

confusionMatrix() - the function prints result of the classification and the confusion matrix. The inputs of the function are given labels and the labels obtained by the classification.

##### 3.4.4 NN classifier

The functions nnLearn and nnClassify for the 1-nearest neighbor classifier ready for implementation.

nn = nnLearn(train, trainLabels) - the inputs of the function are the training data and their labels. The output of the function is the structure with the learned classifier.

classLabels = nnClassify(nn,test) - the inputs of the function are the structure with the learned classifier and the testing data. The output of the function is an array with the class labels.

##### 3.4.5 Naive Bayes classifier

The functions bayesLearn and bayesClassify for the naive Bayes classifier ready for implementation.

bayes = bayesLearn(train, train_labels) - the inputs of the function are the training data and their labels. The output of the function is the structure with the learned classifier.

classLabels = bayesClassify(bayes, test) - the inputs of the function are the structure with the learned classifier and the testing data. The output of the function is an array with the class labels.

### 3.5 Evaluation

• Implementation of the 1-NN classifier [0–5] pts.
• Implementation of the naive Bayes classifier assuming the conditional independence of the probabilities [0–10] pts.
• Robust implementation capable of handling arbitrary number of classes or feature vectors [0–2] pt.
• Bonus points for an effective, interesting implementation and additional features [0–1] pt.

## References

##### [1]

Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer Science+Bussiness Media, New York, NY, 2006.

##### [2]

T.M. Cover and P.E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, January 1967.

##### [3]

Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern classification. Wiley Interscience Publication. John Wiley, New York, 2nd edition, 2001.

##### [4]

Vojtěch Franc and Václav Hlaváč. Statistical pattern recognition toolbox for Matlab. Research Report CTU–CMP–2004–08, Center for Machine Perception, K13133 FEE. Czech Technical University, Prague, Czech Republic, June 2004. http://cmp.felk.cvut.cz/cmp/software/stprtool/index.html.

##### [5]

Michail I. Schlesinger and Václav Hlaváč. Ten Lectures on Statistical and Structural Pattern Recognition. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002.

## Examples of Practice

Figure 4: Example of an automatic text localization in an image. Further information can be found at http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html.

Figure 5: Example of a commercial application for registration plate recognition in a video. Demo videos can found at http://cmp.felk.cvut.cz/cmp/courses/X33KUI/Videos/RP_recognition/ and you have also seen a demonstration of the first algorithm in the first seminar.

1)
Multiset means that e.g. coin of 2 CZK and 5 g may occur repeatedly in training set.
2)
Discrimination functions exists even for Bayesian classifiers. Discuss.
3)
We have used slightly different naming conventions in the lectures $g_s(\vec{x})=\vec{b}_s^\top\vec{x}+c_s$.