======Spam filter specifications====== The spam filter must be implemented in Python. =====Specifications===== Your task is to create a class ''MyFilter'' in the ''filter.py'' module (file) which * can take a training corpus as an input with its ''train(train_corpus_dir)'' method, * can learn the parameters of the filtering method using the training emails (if you decide to do so), and * evaluates all the emails in testing corpora by calling the ''test(test_corpus_dir)'' method, i.e. it creates the ''!prediction.txt'' file in the testing corpus directory. The method must be able to work even if the ''train()'' method is not called before ''test()''. The corpora can contain some special files. Those always have names starting with **!** (i.g. //!truth.txt//) and do not contain any email messages. More detailed information in the following sections. ===== MyFilter Class Usage ===== The ''MyFilter'' class shall be defined in a module called ''filter.py''. The class will be used as follows: from filter import MyFilter filter = MyFilter() filter.train('/path/to/training/coprus') # This folder will contain the !truth.txt file filter.test('/path/to/testing/corpus') # The method shall create the !prediction.txt file in this folder Since the ''test()'' method shall be able to work without a prior call to the ''train()'' method, the following usage is also allowed (and must be supported): from filter import MyFilter filter = MyFilter() filter.test('/path/to/testing/corpus') # The method shall create the !prediction.txt file in this folder When computing the quality of your filter, we will always call the ''train()'' method before ''test()''. ==== Notes ==== - **Time limit**. Learning combined with evaluation of the corpora should not take more than 5 minutes; filtering itself should take much less time. =====Training===== * If your spam filter is not able to learn, it can just ignore the training corpus. If you follow our recommendation, you should have method //train()// implemented in class //MyFilter//; it can be, however, empty. * You can be sure that the training corpus folder will contain file //!truth.txt// with the information about the true class of the corpus emails. =====Testing===== * Your script **has create file !prediction.txt** in folders of all testing corpora. This file has to contain the filename and the prediction of the filter (OK or SPAM) for each file in the folder. One file per line. * Your script of course **cannot use the information in file ''!truth.txt''** in folders of testing corpora. There will not even be such a file. =====Possible filter variants===== - A "hardwired" filter, without the ability to learn. This filter will probably have empty method ''train()''. The ''test()'' method will then somehow work according to procedures chosen by the author. - A filter using some external information/data. An example of this can be a prelearned filter or filter using some external dictionary. This filter should also have empty method ''train()''. The ''test()'' method will then decide based on the information saved in external file(s) which were uploaded together with the filter's source code. A reason for this "architecture" may be e.g. a time demanding method ''train()'': the filter can be taught offline beforehand. (If the student wants the points for the filter's ability to learn, he has to show this ability to the lecturer during lab exercises.) - A learning filter. This filter does not have empty method //train()// and it is fast enough, so that it can extract the needed information from hundreds of emails within the time limit. ===== Submission 1 ===== You shall hand in a ZIP achive with module ''quality.py'' and possibly with other modules needed by ''quality.py''. **These files shall be placed in the root of the archive, and the archive should not contain any folders.** If you have followed the suggestions, your archive should probably contain files ''quality.py'', ''confmat.py'', ''utils.py'', and maybe others. Only the function ''compute_quality_for_corpus()'' (i.e. the solution of [[courses:ae4b99rph:labs:spam:step3|step 3]] will be subject to testing in this phase. The goal of this submission is to ensure that you all have a function which correctly computes the quality of the filter. ===== Submission 2 ===== Hand in a ZIP archive with your filter and all other files it needs to run. **These files should be in the root of the archive, the archive should not contain any subdirectories.** If you followed our instructions, your archive should contain the following files: - ''filter.py''. The implementation of your filter. - ''basefilter.py''. If you found some common functionality for all the filters and extracted it in class ''BaseFilter'' from which your filter class inherits, you must also include the ''basefilter.py'' file. - ''corpus.py'' and ''trainingcorpus.py''. You most likely figured out, that the ''train()'' method of your filter uses class ''TrainingCorpus'', while the ''test()'' method uses ''Corpus'' class. If you use them, you have to turn them in as well. - ''utils.py''. Your ''TrainingCorpus'' class quite likely uses function ''read_classification_from_file'' from ''utils.py'' module. - Any other files your filter needs to work. **Do not hand in:** * modules which are not directly used by your filter, e.g. modules ''quality'' or ''confmat'', * tests for individual steps.