Spam filter - step 5
- Preparation
- Training data corpus
  - Specifications

Spam filter - step 5

Create class TrainingCorpus by deriving it from class Corpus. The class shall encapsulate a corpus with known true classification of the email messages, i.e. it shall represent a corpus usable for filter training.

Tests for step 5:

for step 5 only test5_trainingcorpus.zip or
together with tests for the preceding steps test5_all.zip.

Class TrainingCorpus is not obligatory and it implementation is not fixed. You can implement only those methods that you find useful. The provided tests target all the below mentioned methods; if you decide not to implement them all, then delete (or comment out) the related tests in class TrainingCorpusClass.

Preparation

By now, you should know everything you need to successfully implement the TrainingCorpus class. The only remaining thing to prepare:

Think of what the class shall be able to do so that it simplifies the training of your filter.

Training data corpus

Task:

In module trainingcorpus.py, create class TrainingCorpus.

Why do we need it?

Class TrainingCorpus shall simplify the creation of learning filters. It will allow to walk the corpus with known labels of emails found in file !truth.txt.

Specifications

Specifications for this class are not fixed, it is up to you to decide what methods you need. The following methods can serve as an inspiration (and the test provided for this class assume the existence of these methods):

method get_class(filename) returns the true label (OK or SPAM) for an email message stored in a file with filename.
methods is_ham(filename) and is_spam(filename) return Boolean value (True or False) with obvious meaning for a message stored in a file with filename.
methods spams() and hams() return generators which allow us to walk the spams and hams in the training corpus similarly as method emails() does for the Corpus class.
etc.

It is entirely up to you if you want to implement any of these methods.

Table of Contents

Spam filter - step 5

Preparation

Training data corpus

Specifications