Search
Create class Corpus to encapsulate a directory with emails. Add methods for easy walk through the individual emails.
Corpus
Tests for step 2: test2_corpus.zip
yield
Task:
corpus.py
Why do we need it?
corpus
TrainingCorpus
Class Corpus (in module corpus.py) encapsulates the folder with emails and will help us walk through them. It must have the following properties:
emails()
# Create corpus from a directory corpus = Corpus('/path/to/directory/with/emails') count = 0 # Go through all emails and print the filename and the message body for fname, body in corpus.emails(): print(fname) print(body) print "-------------------------" count += 1 print "Finished:", count, "files processed."
The body of certain email messages contain unicode characters; that is why we use the utf-8 encoding so that we can represent them. However, during the listing using print(body) you can get an exception! It depends on what system and what shell you use to run the above script. The console to which the output is printed has its own encoding, and often different from utf-8. It may then happen that the print function wants to print a character that is unknown to the console.
utf-8
print(body)
print
One of possible solutions is to use print(body.encode()) to print the email body. The encode() method transforms the email message possibly containing unicode characters into a sequence of bytes (bytes datatype) which are printable anywhere. Instead of the problematic unicode character, you will see a sequence of 2 to 4 other characters. But it will do no harm to us.
print(body.encode())
encode()
bytes