This is an old revision of the document!

Spam filter - step 4

Create 3 simple non-adaptive filters, paranoid, naive, and random, and evaluate their quality.

Preparation

You should think about and write down on a piece of paper:

How is a spam filter actually used?
What is the difference (from the implementation standpoint) between a learning filter and a non-learning filter?
Is there any part which all of the spam filters have in common?
Is it better to create a spam filter as a function or as a class with methods and properties?
What are the minimal requirements for such an implementation? What does it have to be able to do, and what inputs and what outputs does it have to have?

Optional (for more advanced programmers): Read how the inheritance in OOP works and how it is used in Python. You can find more information here:

in the official Python tutorial, or
in [Downey2009], chapter 18, section 18.7, or
in [Wentworth2012], chapter 23.

Simple filters

Tasks:

In module simplefilters.py, create 3 classes representing 3 simple filters:
- NaiveFilter which classifies all the emails as OK,
- ParanoidFilter which classifies all the emails as SPAM, and
- RandomFilter which assigns the lables OK and SPAM randomly.
- Optional: If these 3 filters have some functionality in common, try to extract it into a common ancestor called BaseFilter in module basefilter.py.

Why do we need it?

These simple filters will demonstrate the skeleton of the filter and will show the parts common to all filters. We will also have some baseline filters to compare using the functions from step 3.

Specifications

To facilitate later automatic testing of the final filter, we require your filter to be named MyFilter and defined in module filter.py. In this step, however, you shall create 3 classes called NaiveFilter, ParanoidFilter, and RandomFilter placed in module named simplefilters.py.

A filter will be represented by a class with at least 2 public methods: train() and test(). Filters unable to learn from data will probably have the method train() empty. The rest of the class structure is up to you.

Methods train():

Inputs	A path to training corpus, i.e. to a directory with emails, containing also the `!truth.txt` file. (Irrelevant for the simple filters.)
Outputs	None.
Effects	Setup of the inner data structures of the filter, so that they can be later used to classify emails using the `test()` method.

Method test():

Inputs	A path to a corpus to be evaluated. (The directory will not contain the `!truth.txt` file.)
Outputs	None.
Effects	Creates the `!prediction.txt` file containing the predictions of the filter.

Evaluating the quality of simple filters

Create a simple script that computes the quality of a specified filter. The script shall:

import the class of the chosen filter,
call method train() on the first dataset,
call method test() on the second dataset,
call function compute_quality_for_corpus() for the second corpus,
print out the quality, and
remove the file !prediction.txt from the corpus.

Table of Contents

Spam filter - step 4

Preparation

Simple filters

Specifications

Evaluating the quality of simple filters