====== Computer Lab 09, Spam Filter I======
  * Q/A
  * Intro to spam filter
  * Practical exercises


===== Spam filter task - introduction =====
  * Read the [[courses:be5b33prg:homeworks:spam:start|problem definition]], [[courses:be5b33prg:homeworks:spam:introduction|introduction]], [[courses:be5b33prg:homeworks:spam:specifications|specifications]]
  * [[courses:be5b33prg:homeworks:spam:data|Data and their format]] 
  * Spam filter: [[courses:be5b33prg:homeworks:spam:step1|step 1]]

===== Practical work =====

==== Statistics for numbers in a file ====
Assume we have a text file (e.g. ''numbers.txt'') containing integers separated by spaces:

<code>
1 2 1 3 1 4
</code>

In module ''filestats.py'', create function ''compute_file_statistics()'' that takes a path to a text file as its argument, reads in all the numbers, and returns a named tuple ''Statistics'' with fields ''mean'', ''median'', ''min'', ''max''. The statistics names tuple shall be defined as:

<code python>
Statistics = namedtuple('Statistics', 'mean median min max')
</code>

Suggestions:
  * You should implement another function, e.g. ''compute_statistics()'' that will accept a list of numbers as input and will produce the required data structure. Than, the main function may just read the data in, and pass them to this function.
  * Note, that for set with even number of items, median is defined as an average of the 2 middle items (when the collection is sorted).

=== Usage example ===
<code python>
>>> from filestats import compute_file_statistics
>>> compute_file_statistics('numbers.txt')
Statistics(mean=2.0, median=1.5, min=1, max=4)
</code>

==== Countries and capitals ====
Let's have a text file, e.g. ''capitals.csv'' (the .csv extension stands for "comma-separated values") containing a pair of strings on each line. The first string is a name of a country, the second string is a name of its capital:

<code>
Czech Republic,Prague
USA,Washington
Germany,Berlin
Russia,Moscow
</code>

In module 'geography.py', create function ''load_capitals()'' that takes a path to a file containing countries and their capitals as an argument, and reads it into a dictionary.

=== Usage example ===
<code python>
>>> from geography import load_capitals
>>> capitals = load_capitals('capitals.csv')
>>> print(capitals)
{'Czech Republic': 'Prague', 'USA': 'Washington', 'Germany': 'Berlin', 'Russia': 'Moscow'}
</code>
The order of the individual key-value pairs may be different.


==== Collection with unique elements? ====
In module ''utils.py'', create function ''all_elements_unique()'' that checks whether a collection (given as input to the function) has all items unique.

=== Usage example ===
<code python>
>>> from utils import all_elements_unique
>>> all_elements_unique('abcdef')
True
>>> all_elements_unique([1, 2, 7])
True
>>> all_elements_unique('abracadabra')
False
>>> all_elements_unique([1, 1, 2, 7])
False
</code>


==== Unique words ====
In module ''texttools.py'', create function ''get_unique_words(fpath1, fpath2)'' which takes paths to 2 files, and returns a 2 tuple:
  * set of words which were found in the first file but not in the second, and
  * set of words found in the second file, but not in the first one.

=== Example usage ===
Given e.g. the following files ''text1.txt'' and ''text2.txt''
<code>
When I was one,
I had just begun.
</code>

<code>
When I was two,
I was nearly new.
</code>

the result of executing the function may look like this:

<code python>
>>> from texttools import get_unique_words
>>> first, second = get_unique_words('text1.txt', 'text2.txt')
>>> print(first)
{'one', 'had', 'just', 'begun'}
>>> print(second)
{'two', 'nearly', 'new'}
</code>

Again, the order of individual words in the printouts of the resulting sets may differ.
===== Homework =====

<WRAP round important>
Work on the next graded [[courses:be5b33prg:homeworks:files|homework: working with files]].
</WRAP>

And:

  * Implement [[courses:be5b33prg:homeworks:spam:step1|step 1]] of Spam filter task.
  * Prepare for [[courses:be5b33prg:homeworks:spam:step2|step 2]].