Table of Contents

Spam filter - step 2

Create class Corpus to encapsulate a directory with emails. Add methods for easy walk through the individual emails.

Tests for step 2: test2_corpus.zip

Preparation

Corpus

Task:

Why do we need it?

Specifications

Class Corpus (in module corpus.py) encapsulates the folder with emails and will help us walk through them. It must have the following properties:

# Create corpus from a directory
corpus = Corpus('/path/to/directory/with/emails')
count = 0
# Go through all emails and print the filename and the message body
for fname, body in corpus.emails():
    print(fname)
    print(body)
    print "-------------------------"
    count += 1
print "Finished:", count, "files processed."

The body of certain email messages contain unicode characters; that is why we use the utf-8 encoding so that we can represent them. However, during the listing using print(body) you can get an exception! It depends on what system and what shell you use to run the above script. The console to which the output is printed has its own encoding, and often different from utf-8. It may then happen that the print function wants to print a character that is unknown to the console.

One of possible solutions is to use print(body.encode()) to print the email body. The encode() method transforms the email message possibly containing unicode characters into a sequence of bytes (bytes datatype) which are printable anywhere. Instead of the problematic unicode character, you will see a sequence of 2 to 4 other characters. But it will do no harm to us.