Warning
This page is located in archive.

Spam filter - step 2

Create class Corpus to encapsulate a directory with emails. Add methods for easy walk through the individual emails.

Tests for step 2: test2_corpus.zip

Preparation

  • You should already know how to work with text files.
  • How to get a directory listing using function os.listdir()
  • How to create a generator using the yield function (see example in chapter 9.10 of the official Python tutorial.

Corpus

Task:

  • In module corpus.py, create class Corpus encapsulating a folder with files (emails)

Why do we need it?

  • Class corpus will be useful for the batch evaluation of new emails and it will also serve as the basis for class TrainingCorpus, which forms one of the next steps.

Specifications

Class Corpus (in module corpus.py) encapsulates the folder with emails and will help us walk through them. It must have the following properties:

  • During initialization, it will recieve a path to the folder with emails.
  • The class must have method emails() which must be a generator. This method should be aware of the fact that not all the files in the directory contain emails; there may be also special files with names starting with !. These files should be ignored by this method. The method will let us use Corpus like this:

# Create corpus from a directory
corpus = Corpus('/path/to/directory/with/emails')
count = 0
# Go through all emails and print the filename and the message body
for fname, body in corpus.emails():
    print(fname)
    print(body)
    print "-------------------------"
    count += 1
print "Finished:", count, "files processed."

The body of certain email messages contain unicode characters; that is why we use the utf-8 encoding so that we can represent them. However, during the listing using print(body) you can get an exception! It depends on what system and what shell you use to run the above script. The console to which the output is printed has its own encoding, and often different from utf-8. It may then happen that the print function wants to print a character that is unknown to the console.

One of possible solutions is to use print(body.encode()) to print the email body. The encode() method transforms the email message possibly containing unicode characters into a sequence of bytes (bytes datatype) which are printable anywhere. Instead of the problematic unicode character, you will see a sequence of 2 to 4 other characters. But it will do no harm to us.

courses/ae4b99rph/labs/spam/step2.txt · Last modified: 2013/10/29 15:11 by svobodat