====== Classification/recognition (last assignment) ====== The machine learning task has two parts - **symbol classification** and **determination of the optimal classifier parameter**. Check the [[https://cw.felk.cvut.cz/upload/|Upload system ]] for the due date and notice that both tasks are submitted **separately**. It is expected that the classifiers will be implemented by students, usage of advanced pattern recognition libraries is forbidden. If unsure, ask the TA. Data provided by [[http://www.eyedea.cz|Eyedea Recognition]], some data are from public resources. ==== Problem ==== The task is to design a classifier / character recognition program. The input is a small grayscale image of one handwritten character - a letter or number - the output is a class decision, i.e. the recognition of the character in the image. You are given training data, a set of images with the information on the correct classification. This is usually all that the customer provides. After you prepare the code, the customer, in this case represented by the instructor, will use different test data on which to evaluate your work. We recommend dividing the provided data into a training and test set. Your resulting code will be tested on the new data within the AE system. ==== Data ==== The images are in the png format in one folder, where we also provide the file ''truth.dsv'' ([[https://en.wikipedia.org/wiki/Delimiter-separated_values|dsv format]]). The file names are not related to the file content. The file truth.dsv has on each line ''file_name.png:character'', e.g. ''img_3124.png:A''. The separator character is '':'', which is never in the name of the file. The names of the files contain only characters, numbers or underscores (_). * {{ :courses:b3b33kui:cviceni:strojove_uceni:train_1000_10.zip |}} images 10x10, 20 classes, 50 exemplars for each. * {{ :courses:b3b33kui:cviceni:strojove_uceni:train_1000_28.zip |}} images 28x28, 10 classes, 100 exemplars for each * {{ :courses:b3b33kui:cviceni:strojove_uceni:train_700_28.zip |}} images 28x28, 10 classes, varying number of exemplars ==== Interface specification ==== Implement k-NN and Naive Bayes classifiers. The main code will be in ''classifier.py'' >> python3.8 classifier.py -h usage: classifier.py [-h] [-k K] [-b] train_path test_path Learn and Classify image data positional arguments: train_path Path to the training data test_path Path to the testing data optional arguments: -h, --help show this help message and exit -k K run K-NN classifier -b run Naive Bayes classifier -o name name (with path) of the output dsv file with the results Example python3 classifier.py -k 3 -o classification.dsv ./train_data ./test_data runs 3-NN training and testing (classification) classifier and saves the data as ''classification.dsv''. The saved data must be of the same format as ''truth.dsv''. ==== Examples of use ==== {{:courses:be5b33kui:labs:machine_learning:02_03_04_obr5.png?800|Automatic text localization from pictures. More information available at [[http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html|http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html]].}} Fig. 3: //Automatic text localization from pictures. More information available at [[http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html|http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html]].// {{:courses:be5b33kui:labs:machine_learning:02_03_04_obr6.png?800|Industry application for license plate recognition. Videos are available at http://cmp.felk.cvut.cz/cmp/courses/X33KUI/Videos/RP_recognition.}} Fig. 4: //Industry application for license plate recognition. Videos are available at [[http://cmp.felk.cvut.cz/cmp/courses/X33KUI/Videos/RP_recognition|http://cmp.felk.cvut.cz/cmp/courses/X33KUI/Videos/RP_recognition]].// ====== Selection of optimal classifier ====== Very often, there are many classifiers that can be used for a ML task, and we have to make a decision about which is the best classifier for the task. In the zip archive {{ :courses:b3b33kui:cviceni:strojove_uceni:classif_result_tables.zip |classif_result_tables.zip}}, there are all files that will be used for the task. The task shall be solved in Python. What is asked to be uploaded are a **pdf report** and one function related to the part **Safety first**. We have 5 different learned binary classifiers. The result of the classification of each classifier depends on the value of the $\alpha$ parameter. Thus, the result of the classification of a given classifier can be expressed as a function $C(\bf x, \alpha) \in \{0,1\}$, where $\bf x$ is a vector which belongs to the sample we want to classify. We tested all classifiers on a test set $X = \{{\bf x}_0, {\bf x}_1, \dots, {\bf x}_{99} \}$. At the same time, we tried all possible values of $\alpha\in\{\alpha_0, \alpha_1, \dots, \alpha_{49}\}$. For a classifier $i\in\{1,2, \dots,5\}$ we obtain a table with values $C_i(\bf x_j,\alpha_k)\in \{0,1\}$, where $j \in \{0, 1,.., 99\}$, $k \in \{0, 1,.., 49\}$ (see //C1//, //C2//, //C3//, //C4//, //C5// in ''classif_result_tables.zip''). The real labels of the samples $\bf x_0, \bf x_1, \dots, \bf x_{99}$ from the test set are available (see //GT// in ''classif_result_tables.zip''). === Selection of appropriate parameter === In this section, suppose that the classifiers are used for binary classification of images (e.g. whether a dog is on the picture or not). For the classifier 1 (table //C1//), determine the best value for parameter $\{\alpha_0, \alpha_1, \dots, \alpha_{49}\}$. Be aware that you don’t know the concrete task for which the classifier will be used. Therefore, it is necessary to use a sufficiently general approach. In other words, the classifier should not be one that is optimal for a particular task but globaly inefficient for most other tasks. In a short (definitely shorter than one A4 page) **pdf report** explain the choice of the parameter (use terms such as sensitivity, false positive, ROC curve etc.). Inside the report, put the figure of a ROC curve with a marked point on the curve which correspond to the optimal value of the parameter. === Top secret! === Imagine, that you are an agent 00111 and you want to use your fingerprint to secure some top secret documents. The data are very sensitive, so it is better to delete them than secure them poorly. You also know that you will always have enough time to unlock the data. Five trained classifiers (with different $\alpha$ values) are available. The input of the classifier is a fingerprint scan. For your fingerprint, desired output of the classifier is 1 (data will be unlocked), 0 otherwise (if it is not your fingerprint). All classifiers were tested using the test set $X$ for all possible values of the parameter $\alpha$. Results of the test for the classifiers are saved in tables //C1//, //C2//, //C3//, //C4//, //C5// (see above). Ground truth values (real fingerprint affiliation) of the different scans are also available (see //GT//) Select the most suitable classifier and its $\alpha$ parameter. In the **pdf report** write your choice and explain the criterias you used for the choice. === Safety first === This part is a continuation of the previous part **Top secret!**. A colleague, also an agent, will send you his classifier which also depends on the parameter $\alpha$. However, you are not sure about his loyalty, as he may be a double agent. Thus, it will be necessary to find if his classifier is better than the classifier you selected in the previous section. For security reasons, you will have to make the decision about his classifier using a function that will be created in advance. Input of the function will be table //C6// with the results of the classification on the set for different $\alpha$ parameters (same format as //C1//, //C2//, etc.) and eventually other input parameters of your choice. The output of the function should be the decision if the new classifier is better than the one that you selected yourself (''true'' if the obtained classifier is better than the previous one, ''false'' otherwise). In the **pdf report** explain the criterias that the function use. Submit also the function. ====== References ====== Christopher M. Bishop. //Pattern Recognition and Machine Learning.// Springer Science+Bussiness Media, New York, NY, 2006. T.M. Cover and P.E. Hart. Nearest neighbor pattern classification. //IEEE Transactions on Information Theory,// 13(1):21–27, January 1967. Richard O. Duda, Peter E. Hart, and David G. Stork. //Pattern classification.// Wiley Interscience Publication. John Wiley, New York, 2nd edition, 2001. Vojtěch Franc and Václav Hlaváč. //Statistical pattern recognition toolbox for Matlab.// Research Report CTU–CMP–2004–08, Center for Machine Perception, K13133 FEE. Czech Technical University, Prague, Czech Republic, June 2004. http://cmp.felk.cvut.cz/cmp/software/stprtool/index.html. Michail I. Schlesinger and Václav Hlaváč. //Ten Lectures on Statistical and Structural Pattern Recognition.// Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002. ===== Evaluation ===== Both tasks are evaluated separately ==== Symbol classification - Evaluation ==== Automated Evaluation (AE) will only check if your code works well and show you the correctly classified ratio on the small AE dataset. However, the actual points will be awarded in a later ("tournament") run on a large dataset. * Closest neighbor classifier (1-NN) is evaluated according to the table below. [0–3 points] * The Naive Bayes classifier follows also the table below: [0–5 points] * Code quality: [0–2 points] ^1-NN^ | correctly classified | points | | >95% | 3 | | >80% | 2 | | >60% | 1 | | =<60% | 0 | ^Naive Bayes classifier^ | correctly classified | points | | >82% | 5 | | >75% | 4 | | >70% | 3 | | >65% | 2 | | >60% | 1 | | >55% | 0.5 | | =<55% | 0 | ==== Selection of optimal classifier -- Evaluation ==== * Send a PDF report where you determine the optimal parameter. Based on the chosen parameter and the method of choosing the parameter you receive the points. [0–5 points]