====== Machine Learning - Symbol Classification [Recognition] (last assignment p1) ====== The machine learning task has two parts - **symbol classification** and **determination of the optimal classifier parameter**. Check the [[https://cw.felk.cvut.cz/upload/|Upload system ]] for the due date and notice that both tasks are submitted **separately**. It is expected that the classifiers will be implemented by students, usage of advanced pattern recognition libraries is forbidden. If unsure, ask the TA. Data provided by [[http://www.eyedea.cz|Eyedea Recognition]], some data are from public resources. ==== Problem ==== The task is to design a classifier / character recognition program. The input is a small grayscale image of one handwritten character - a letter or number - the output is a class decision, i.e. the recognition of the character in the image. You are given training data, a set of images with the information on the correct classification. This is usually all that the customer provides. After you prepare the code, the customer, in this case represented by the instructor, will use different test data on which to evaluate your work. We recommend dividing the provided data into a training and test set. Your resulting code will be tested on the new data within the AE system. ==== Data ==== The images are in the png format in one folder, where we also provide the file ''truth.dsv'' ([[https://en.wikipedia.org/wiki/Delimiter-separated_values|dsv format]]). The file names are not related to the file content. The file truth.dsv has on each line ''file_name.png:character'', e.g. ''img_3124.png:A''. The separator character is '':'', which is never in the name of the file. The names of the files contain only characters, numbers or underscores (_). * {{ :courses:b3b33kui:cviceni:strojove_uceni:train_1000_10.zip |}} images 10x10, 20 classes, 50 exemplars for each. * {{ :courses:b3b33kui:cviceni:strojove_uceni:train_1000_28.zip |}} images 28x28, 10 classes, 100 exemplars for each * {{ :courses:b3b33kui:cviceni:strojove_uceni:train_700_28.zip |}} images 28x28, 10 classes, varying number of exemplars ==== Interface specification ==== Implement k-NN and Naive Bayes classifiers. The main code will be in ''classifier.py'' >> python3.8 classifier.py -h usage: classifier.py [-h] (-k K | -b) [-o filepath] train_path test_path Learn and classify image data. positional arguments: train_path path to the training data directory test_path path to the testing data directory optional arguments: -h, --help show this help message and exit -k K run k-NN classifier (if k is 0 the code may decide about proper K by itself -b run Naive Bayes classifier -o filepath path (including the filename) of the output .dsv file with the results Example python3 classifier.py -k 3 -o classification.dsv ./train_data ./test_data runs 3-NN training and testing (classification) classifier and saves the data as ''classification.dsv''. The saved data must be of the same format as ''truth.dsv''. The classifier creates file ''classification.dsv'' (with the same format as ''truth.dsv'') in the test data directory. ==== Solution structure ==== If you are not sure how to solve this task, we offer the following tips. Your solution will probably use the following partial steps: * [[.:argparse|Command-line arguments processing]] (including basic solution skeleton) * [[.:listdir|Listing a directory content]] * [[.:readcsv|Reading .dsv file]] * [[.:image|Reading .png image in the form of numerical vector]] * [[.:dist|Distance of two images]] * For the work with image data, we suggest to use [[https://numpy.org/doc/stable/reference/generated/numpy.array.html|numpy.array]]. If you would like to use other libraries, do not forget to test your solution in BRUTE early, or ask your lab instructor if you library is not too exotic. ==== Examples of use ==== {{:courses:be5b33kui:labs:machine_learning:02_03_04_obr5.png?800|Automatic text localization from pictures. More information available at [[http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html|http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html]].}} Fig. 3: //Automatic text localization from pictures. More information available at [[http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html|http://cmp.felk.cvut.cz/~zimmerk/lpd/index.html]].// {{:courses:be5b33kui:labs:machine_learning:02_03_04_obr6.png?800|Industry application for license plate recognition. Videos are available at http://cmp.felk.cvut.cz/cmp/courses/X33KUI/Videos/RP_recognition.}} Fig. 4: //Industry application for license plate recognition. Videos are available at [[http://cmp.felk.cvut.cz/cmp/courses/X33KUI/Videos/RP_recognition|http://cmp.felk.cvut.cz/cmp/courses/X33KUI/Videos/RP_recognition]].// ====== References ====== Christopher M. Bishop. //Pattern Recognition and Machine Learning.// Springer Science+Bussiness Media, New York, NY, 2006. T.M. Cover and P.E. Hart. Nearest neighbor pattern classification. //IEEE Transactions on Information Theory,// 13(1):21–27, January 1967. Richard O. Duda, Peter E. Hart, and David G. Stork. //Pattern classification.// Wiley Interscience Publication. John Wiley, New York, 2nd edition, 2001. Vojtěch Franc and Václav Hlaváč. //Statistical pattern recognition toolbox for Matlab.// Research Report CTU–CMP–2004–08, Center for Machine Perception, K13133 FEE. Czech Technical University, Prague, Czech Republic, June 2004. http://cmp.felk.cvut.cz/cmp/software/stprtool/index.html. Michail I. Schlesinger and Václav Hlaváč. //Ten Lectures on Statistical and Structural Pattern Recognition.// Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002. ==== Symbol classification - Evaluation ==== Automated Evaluation (AE) will only check if your code works well and show you the correctly classified ratio on the small AE dataset. However, the actual points will be awarded in a later ("tournament") run on a large dataset. * Closest neighbor classifier (1-NN) is evaluated according to the table below. [0–3 points] * The Naive Bayes classifier follows also the table below: [0–5 points] * Auto-evaluation: [0–1 points] * Code quality: [0–1 points] ^1-NN^ | correctly classified | points | | >95% | 3 | | >80% | 2 | | >60% | 1 | | =<60% | 0 | ^Naive Bayes classifier^ | correctly classified | points | | >82% | 5 | | >75% | 4 | | >70% | 3 | | >65% | 2 | | >60% | 1 | | >55% | 0.5 | | =<55% | 0 |