====== Spam filter - step 4 ====== Create 3 simple non-adaptive filters, paranoid, naive, and random, and evaluate their quality. /** [[.unit_testing|Tests]] for step 4: * for step 4 only {{:courses:a4b99rph:cviceni:spam:test4_simplefilters.zip|}} or * together with tests for the preceding steps {{:courses:a4b99rph:cviceni:spam:test4_all.zip|}}. **/ =====Preparation===== Required features of Python: * You should already know how to work with text files. * How to get a directory listing using function [[http://docs.python.org/py3k/library/os.html?highlight=listdir#os.listdir|os.listdir()]] You should think about and write down on a piece of paper: * How is a spam filter actually used? * What is the difference (from the implementation standpoint) between a learning filter and a non-learning filter? * Is there any part which all of the spam filters have in common? **Optional (for more advanced programmers):** Read how the //inheritance of OOP// works in Python. You can find more information here: * in the official [[https://docs.python.org/3/tutorial/classes.html#inheritance|Python tutorial]], or * in {[a4b99rph:Wentworth2012]}, [[http://openbookproject.net/thinkcs/python/english3e/inheritance.html|chapter 23]]. ===== Simple filters ===== Tasks: * **In module ''simplefilters.py''**, create 3 classes representing 3 simple filters: * ''NaiveFilter'' which classifies all the emails as ''OK'', * ''ParanoidFilter'' which classifies all the emails as ''SPAM'', and * ''RandomFilter'' which assigns the lables ''OK'' and ''SPAM'' randomly. * **Optional:** If these 3 filters have some functionality in common, try to extract it into a common ancestor called ''BaseFilter'' **in module ''basefilter.py''**. Why do we need it? * These simple filters will demonstrate the skeleton of the filter and will show the parts common to all filters. We will also have some baseline filters to compare using the functions from step 3. ==== Specifications ==== To facilitate later automatic testing of the final filter, we require your filter to be named ''MyFilter'' and defined in module ''filter.py''. In this step, however, you shall create 3 classes called ''NaiveFilter'', ''ParanoidFilter'', and ''RandomFilter'' placed in module named ''simplefilters.py''. A filter will be represented by a class with at least 2 public methods: ''train()'' and ''test()''. Filters unable to learn from data will probably have the method ''train()'' empty. The rest of the class structure is up to you. Methods ''train()'': ^ Inputs | A path to training corpus, i.e. to a directory with emails, containing also the ''!truth.txt'' file. (Irrelevant for the simple filters.)| ^ Outputs | None. | ^ Effects | Setup of the inner data structures of the filter, so that they can be later used to classify emails using the ''test()'' method. | Method ''test()'': ^ Inputs | A path to a corpus to be evaluated. (The directory will not contain the ''!truth.txt'' file.) | ^ Outputs | None. | ^ Effects | Creates the ''!prediction.txt'' file containing the predictions of the filter. | >{{page>courses:a4b99rph:internal:cviceni:spam:tyden08#spolecny_predek&editbtn}} >{{page>courses:a4b99rph:internal:cviceni:spam:tyden08#jednoduche_filtry&editbtn}} ===== Evaluating the quality of simple filters ===== Create a simple script that computes the quality of a specified filter. The script shall: * import the class of the chosen filter, * call method ''train()'' on the first dataset, * call method ''test()'' on the second dataset, * call function ''compute_quality_for_corpus()'' for the second corpus, * print out the quality, and * remove the file ''!prediction.txt'' from the corpus.