Warning
This page is located in archive.

Spam filter - step 1

We are going to create a function, which can read the information from files !truth.txt or !prediction.txt into the dictionary data structure.

Preparation

  • Working with a dictionary (see [Pilgrim2004], chapter 2.7, or [Wentworth2012], chapter 20).
    • How to create an empty dictionary.
    • How to add a key-value pair.
    • How to read a value of a key.
    • How to browse the dictionary by items using method items():
      eng_to_cz = {'cat': 'kocka', 'dog': 'pes', 'house': 'dum' }
      for eng, cz in eng_to_cz.items():
          print(eng, ',', cz)
  • Working with (text) files (viz [Pilgrim2009], chapter 11, or [Wentworth2012], chapter 13).
    • How to open and close a text file.
    • How to use the with command.
    • Reading a file line by line.
    • Reading the whole file contents as a single string.
  • The usage of section
    if __name__ == "__main__":
    (see [Pilgrim2009], chapter 1.10).
  • Method split() of string values (see the Python docs for str.split())

Reading classification from a file

Task:

  • In a module called utils.py, create a function read_classification_from_file() that will read the mail classes from a text file.

Why do we need it:

  • We will need this function if we want to create a learning filter, and during the evaluation of the filter quality.

Specifications

Function read_classification_from_file() (in module utils.py) has to conform to the following specifications:

Input The path to the text file (most likely either !truth.txt or !prediction.txt)
Output A dictionary containing either SPAM or OK label for each filename in email corpus.

The function loads a text file contaning a pair of strings per line, separated by single space, like this:

email01 OK
email02 OK
email03 SPAM
email1234 OK
...
and creates a dictionary (the order of individual “rows” in the following listing is not important):
{'email1234': 'OK', 'email03': 'SPAM', 'email02': 'OK', 'email01': 'OK'}

If the file is empty, it returns an empty dictionary.

courses/ae4b99rph/labs/spam/step1.txt · Last modified: 2013/10/29 15:07 by svobodat