======Spam filter - step 1====== We are going to create a function, which can read the information from files ''!truth.txt'' or ''!prediction.txt'' into the //dictionary// data structure. /** * [[.unit_testing|Tests]] for step 1: {{:courses:a4b99rph:cviceni:spam:test1_readclassification.zip|}} **/ =====Preparation===== ++++ Working with a dictionary | * See {[a4b99rph:Pilgrim2009]}, chapter [[http://www.diveinto.org/python3/native-datatypes.html#dictionaries|2.7]], or {[a4b99rph:Wentworth2012]}, chapter [[http://openbookproject.net/thinkcs/python/english3e/dictionaries.html|20]]). * How to create an empty dictionary. * How to add a key-value pair. * How to read a value of a key. * How to browse the dictionary by items using method ''items()'': eng_to_cz = {'cat': 'kocka', 'dog': 'pes', 'house': 'dum' } for eng, cz in eng_to_cz.items(): print(eng, ',', cz) ++++ ++++ Working with (text) files | *(viz {[a4b99rph:Pilgrim2009]}, chapter [[http://diveinto.org/python3/files.html|11]], or {[a4b99rph:Wentworth2012]}, chapter [[http://openbookproject.net/thinkcs/python/english3e/files.html|13]]). * How to open and close a text file. * How to use the ''with'' command. * Reading a file line by line. * Reading the whole file contents as a single string. ++++ ++++ The usage of section ''if __name__ == "__main__":'' | * See {[a4b99rph:Pilgrim2009]}, chapter [[http://diveinto.org/python3/your-first-python-program.html#runningscripts|1.10]]). ++++ ++++ Method ''split()'' of string values | * See the Python docs for [[http://docs.python.org/py3k/library/stdtypes.html?highlight=split#str.split|str.split()]]) ++++ ===== Reading classification from a file ===== Task: * In a module called ''utils.py'', create a function ''read_classification_from_file()'' that will read the mail classes from a text file. Why do we need it: * We will need this function if we want to create a learning filter, and during the evaluation of the filter quality. ==== Specifications ==== Function ''read_classification_from_file()'' (in module ''utils.py'') has to conform to the following specifications: ^ ''read_classification_from_file(fpath)'' ^^ ^ Input | The path to the text file (most likely either ''!truth.txt'' or ''!prediction.txt'') | ^ Output | A dictionary containing either ''SPAM'' or ''OK'' label for each filename in email corpus. | The function loads a text file contaning a pair of strings per line, separated by single space, like this: email01 OK email02 OK email03 SPAM email1234 OK and creates a dictionary (the order of individual "rows" in the following listing is not important): {'email1234': 'OK', 'email03': 'SPAM', 'email02': 'OK', 'email01': 'OK'} If the file is empty, it returns an empty dictionary. > {{page>courses:a4b99rph:internal:cviceni:spam:tyden07#read_classification_from_file&editbtn}} ===== Writing classification (predictions) to a file ===== Task: * In module ''utils.py'', create function ''write_classification_to_file()'' that will write the (usually predicted) mail classes to a text file. Why do we need it: * The function will come handy when writing the filter; it can be used to create the ''!prediction.txt'' file. ==== Specifications ==== Function ''write_classification_to_file()'' (in module ''utils.py'') should conform to the following specifications: ^ ''write_classification_to_file(cls_dict, fpath)'' ^^ ^ Inputs | (1) dictionary containing the email file names as keys, and email classes (''SPAM'' or ''OK'') as values. | ^ | (2) The path to the text file that shall be created. | ^ Output | None. | The following code >>> cls_dict = {'email1234': 'OK', 'email03': 'SPAM', 'email02': 'OK', 'email01': 'OK'} >>> fpath = '1/!prediction.txt' >>> write_classification_to_file(cls_dict, fpath) shall create file ''!prediction.txt'' in directory ''1'' (the directory must exist) with the following contents: email01 OK email02 OK email03 SPAM email1234 OK The actual order of individual rows in the file is not important. If the ''cls_dict'' is empty, the function shall create an empty file.