====== Computer Lab 09, Spam Filter I====== * Q/A * Intro to spam filter * Practical exercises ===== Spam filter task - introduction ===== * Read the [[courses:be5b33prg:homeworks:spam:start|problem definition]], [[courses:be5b33prg:homeworks:spam:introduction|introduction]], [[courses:be5b33prg:homeworks:spam:specifications|specifications]] * [[courses:be5b33prg:homeworks:spam:data|Data and their format]] * Spam filter: [[courses:be5b33prg:homeworks:spam:step1|step 1]] ===== Practical work ===== ==== Statistics for numbers in a file ==== Assume we have a text file (e.g. ''numbers.txt'') containing integers separated by spaces: 1 2 1 3 1 4 In module ''filestats.py'', create function ''compute_file_statistics()'' that takes a path to a text file as its argument, reads in all the numbers, and returns a named tuple ''Statistics'' with fields ''mean'', ''median'', ''min'', ''max''. The statistics names tuple shall be defined as: Statistics = namedtuple('Statistics', 'mean median min max') Suggestions: * You should implement another function, e.g. ''compute_statistics()'' that will accept a list of numbers as input and will produce the required data structure. Than, the main function may just read the data in, and pass them to this function. * Note, that for set with even number of items, median is defined as an average of the 2 middle items (when the collection is sorted). === Usage example === >>> from filestats import compute_file_statistics >>> compute_file_statistics('numbers.txt') Statistics(mean=2.0, median=1.5, min=1, max=4) ==== Countries and capitals ==== Let's have a text file, e.g. ''capitals.csv'' (the .csv extension stands for "comma-separated values") containing a pair of strings on each line. The first string is a name of a country, the second string is a name of its capital: Czech Republic,Prague USA,Washington Germany,Berlin Russia,Moscow In module 'geography.py', create function ''load_capitals()'' that takes a path to a file containing countries and their capitals as an argument, and reads it into a dictionary. === Usage example === >>> from geography import load_capitals >>> capitals = load_capitals('capitals.csv') >>> print(capitals) {'Czech Republic': 'Prague', 'USA': 'Washington', 'Germany': 'Berlin', 'Russia': 'Moscow'} The order of the individual key-value pairs may be different. ==== Collection with unique elements? ==== In module ''utils.py'', create function ''all_elements_unique()'' that checks whether a collection (given as input to the function) has all items unique. === Usage example === >>> from utils import all_elements_unique >>> all_elements_unique('abcdef') True >>> all_elements_unique([1, 2, 7]) True >>> all_elements_unique('abracadabra') False >>> all_elements_unique([1, 1, 2, 7]) False ==== Unique words ==== In module ''texttools.py'', create function ''get_unique_words(fpath1, fpath2)'' which takes paths to 2 files, and returns a 2 tuple: * set of words which were found in the first file but not in the second, and * set of words found in the second file, but not in the first one. === Example usage === Given e.g. the following files ''text1.txt'' and ''text2.txt'' When I was one, I had just begun. When I was two, I was nearly new. the result of executing the function may look like this: >>> from texttools import get_unique_words >>> first, second = get_unique_words('text1.txt', 'text2.txt') >>> print(first) {'one', 'had', 'just', 'begun'} >>> print(second) {'two', 'nearly', 'new'} Again, the order of individual words in the printouts of the resulting sets may differ. ===== Homework ===== Solve homework [[courses:be5b33prg:homeworks:files|homework: working with files]]. See the deadline in UploadSystem. And: * Implement [[courses:be5b33prg:homeworks:spam:step1|step 1]] of Spam filter task. * Prepare for [[courses:be5b33prg:homeworks:spam:step2|step 2]].