Spam filter task specifications

Specifications

Your task is to create a class MyFilter in module (file) filter.py which

takes a training corpus as an input with its MyFilter.train(train_corpus_dir) method,
can use the training corpus to fit the spam filter to data (if you decide to do so), and
evaluates all the emails in testing corpora by calling the MyFilter.test(test_corpus_dir) method, which creates the !prediction.txt file in the testing corpus directory. The method must be able to work even if the train() method is not called before test().

A corpus can contain some special files. For us, those files always have names starting with ! (e.g. !truth.txt) and do not contain any email messages.

More detailed information in the following sections.

MyFilter Class Usage

Class MyFilter shall be defined in a module called filter.py. The class will be used as follows:

from filter import MyFilter
 
filter = MyFilter()
filter.train('/path/to/training/coprus')  # This folder will contain the !truth.txt file
filter.test('/path/to/testing/corpus')    # The method shall create the !prediction.txt file in this folder

Since the test() method shall be able to work without a prior call to the train() method, the following usage is also allowed (and must be supported):

from filter import MyFilter
 
filter = MyFilter()
filter.test('/path/to/testing/corpus')    # The method shall create the !prediction.txt file in this folder

When computing the quality of your filter, we will always call the train() method before test().

Notes

Time limit. Learning combined with evaluation of the corpora should not take more than 5 minutes; filtering itself should take much less time.

Training

If you do not plan to create a learning spam filter, you can just ignore the training corpus. According to our specifications, however, class MyFilter must implement method train(), but it can be empty.
You can be sure that the training corpus folder will contain file !truth.txt with the information about the true class of the corpus emails. On the other hand, testing corpus will not contain this file.

Testing

Method test() must create file !prediction.txt in folder of each testing corpus. This file must contain the filename and the prediction of the filter (OK or SPAM) for each email file in the folder. One file per line.
Your filtr of course cannot use the information in file !truth.txt in folders of testing corpora. There will be no such file.

Possible filter variants

A “hardwired” filter, without the ability to learn. This filter will probably have an empty method train(). The test() method will then employ some decision procedures chosen by the author.
A filter using some external information/data. An example of this can be a prelearned filter or filter using some external dictionary. This filter should also have an empty method train(). The test() method will then decide based on the information saved in external file(s) which were uploaded together with the filter's source code. A reason for this “architecture” may be e.g. a time demanding method train(): the filter can be taught offline beforehand. (If the student wants the points for the filter's ability to learn, s/he has to show this ability to the lecturer during lab exercises.)
A learning filter. This filter has a non-empty method train() and it is fast enough, so that it can extract the needed information from hundreds of emails within the time limit.

Submission 1

You shall hand in a ZIP achive with module quality.py and possibly with other modules needed by quality.py. These files shall be placed in the root of the archive, and the archive should not contain any folders. If you have followed the suggestions, your archive should probably contain files quality.py, utils.py, and maybe others.

Only the function compute_quality_for_corpus() (i.e. the solution of step 3 will be subject to testing in this phase. The goal of this submission is to ensure that you all have a function which correctly computes the quality of the filter.

Submission 2

Hand in a ZIP archive with your filter and all other files it needs to run. These files should be in the root of the archive, the archive should not contain any subdirectories. If you followed our instructions, your archive should contain the following files:

filter.py. The implementation of your filter.
basefilter.py. If you found some common functionality for all the filters and extracted it in class BaseFilter from which your filter class inherits, you must also include the basefilter.py file.
utils.py. Some of your classes quite likely use functions read_classification_from_file and write_classification_to_file from utils.py module.
Any other files your filter needs to work.

Do not hand in:

modules which are not directly used by your filter, e.g. modules quality or confmat,
tests for individual steps.

Table of Contents