====== Computer Lab 09, Spam Filter I======
* Q/A
* Intro to spam filter
* Practical exercises
===== Spam filter task - introduction =====
* Read the [[courses:be5b33prg:homeworks:spam:start|problem definition]], [[courses:be5b33prg:homeworks:spam:introduction|introduction]], [[courses:be5b33prg:homeworks:spam:specifications|specifications]]
* [[courses:be5b33prg:homeworks:spam:data|Data and their format]]
* Spam filter: [[courses:be5b33prg:homeworks:spam:step1|step 1]]
===== Practical work =====
==== Statistics for numbers in a file ====
Assume we have a text file (e.g. ''numbers.txt'') containing integers separated by spaces:
1 2 1 3 1 4
In module ''filestats.py'', create function ''compute_file_statistics()'' that takes a path to a text file as its argument, reads in all the numbers, and returns a named tuple ''Statistics'' with fields ''mean'', ''median'', ''min'', ''max''. The statistics names tuple shall be defined as:
Statistics = namedtuple('Statistics', 'mean median min max')
Suggestions:
* You should implement another function, e.g. ''compute_statistics()'' that will accept a list of numbers as input and will produce the required data structure. Than, the main function may just read the data in, and pass them to this function.
* Note, that for set with even number of items, median is defined as an average of the 2 middle items (when the collection is sorted).
=== Usage example ===
>>> from filestats import compute_file_statistics
>>> compute_file_statistics('numbers.txt')
Statistics(mean=2.0, median=1.5, min=1, max=4)
==== Countries and capitals ====
Let's have a text file, e.g. ''capitals.csv'' (the .csv extension stands for "comma-separated values") containing a pair of strings on each line. The first string is a name of a country, the second string is a name of its capital:
Czech Republic,Prague
USA,Washington
Germany,Berlin
Russia,Moscow
In module 'geography.py', create function ''load_capitals()'' that takes a path to a file containing countries and their capitals as an argument, and reads it into a dictionary.
=== Usage example ===
>>> from geography import load_capitals
>>> capitals = load_capitals('capitals.csv')
>>> print(capitals)
{'Czech Republic': 'Prague', 'USA': 'Washington', 'Germany': 'Berlin', 'Russia': 'Moscow'}
The order of the individual key-value pairs may be different.
==== Collection with unique elements? ====
In module ''utils.py'', create function ''all_elements_unique()'' that checks whether a collection (given as input to the function) has all items unique.
=== Usage example ===
>>> from utils import all_elements_unique
>>> all_elements_unique('abcdef')
True
>>> all_elements_unique([1, 2, 7])
True
>>> all_elements_unique('abracadabra')
False
>>> all_elements_unique([1, 1, 2, 7])
False
==== Unique words ====
In module ''texttools.py'', create function ''get_unique_words(fpath1, fpath2)'' which takes paths to 2 files, and returns a 2 tuple:
* set of words which were found in the first file but not in the second, and
* set of words found in the second file, but not in the first one.
=== Example usage ===
Given e.g. the following files ''text1.txt'' and ''text2.txt''
When I was one,
I had just begun.
When I was two,
I was nearly new.
the result of executing the function may look like this:
>>> from texttools import get_unique_words
>>> first, second = get_unique_words('text1.txt', 'text2.txt')
>>> print(first)
{'one', 'had', 'just', 'begun'}
>>> print(second)
{'two', 'nearly', 'new'}
Again, the order of individual words in the printouts of the resulting sets may differ.
===== Homework =====
Work on the next graded [[courses:be5b33prg:homeworks:files|homework: working with files]].
And:
* Implement [[courses:be5b33prg:homeworks:spam:step1|step 1]] of Spam filter task.
* Prepare for [[courses:be5b33prg:homeworks:spam:step2|step 2]].