====== Spam filter - step 2 ======
Create class ''Corpus'' to encapsulate a directory with emails. Add methods for easy walk through the individual emails.
[[.unit_testing|Tests]] for step 2: {{:courses:a4b99rph:cviceni:spam:test2_corpus.zip|}}
=====Preparation=====
* You should already know how to work with text files.
* How to get a directory listing using function [[http://docs.python.org/py3k/library/os.html?highlight=listdir#os.listdir|os.listdir()]]
* How to create a generator using the ''yield'' function (see example in chapter [[http://docs.python.org/py3k/tutorial/classes.html#generators|9.10]] of the official Python tutorial.
=====Corpus=====
Task:
* In module ''corpus.py'', create class ''Corpus'' encapsulating a folder with files (emails)
Why do we need it?
* Class ''corpus'' will be useful for the batch evaluation of new emails and it will also serve as the basis for class ''TrainingCorpus'', which forms one of the next steps.
==== Specifications ====
Class ''Corpus'' (in module ''corpus.py'') encapsulates the folder with emails and will help us walk through them. It must have the following properties:
* During initialization, it will recieve a path to the folder with emails.
* The class must have method ''emails()'' which must be a generator. This method should be aware of the fact that not all the files in the directory contain emails; there may be also special files with names starting with **!**. These files should be ignored by this method. The method will let us use ''Corpus'' like this:
# Create corpus from a directory
corpus = Corpus('/path/to/directory/with/emails')
count = 0
# Go through all emails and print the filename and the message body
for fname, body in corpus.emails():
print(fname)
print(body)
print "-------------------------"
count += 1
print "Finished:", count, "files processed."
The body of certain email messages contain unicode characters; that is why we use the ''utf-8'' encoding so that we can represent them. **However, during the listing using ''print(body)'' you can get an exception!** It depends on what system and what shell you use to run the above script. The console to which the output is printed has its own encoding, and often different from ''utf-8''. It may then happen that the ''print'' function wants to print a character that is unknown to the console.
One of possible solutions is to use ''print(body.encode())'' to print the email body. The ''encode()'' method transforms the email message possibly containing unicode characters into a sequence of bytes (''bytes'' datatype) which are printable anywhere. Instead of the problematic unicode character, you will see a sequence of 2 to 4 other characters. But it will do no harm to us.
> {{page>courses:a4b99rph:internal:cviceni:spam:tyden07#corpus&editbtn}}